long-standing data loss bug in initial sync of logical replication

Started by Tomas Vondraabout 2 years ago105 messages

tomas.vondra@enterprisedb.com

about 2 years ago

5 attachment(s)

Hi,

It seems there's a long-standing data loss issue related to the initial
sync of tables in the built-in logical replication (publications etc.).
I can reproduce it fairly reliably, but I haven't figured out all the
details yet and I'm a bit out of ideas, so I'm sharing what I know with
the hope someone takes a look and either spots the issue or has some
other insight ...

On the pgsql-bugs, Depesz reported reported [1]/messages/by-id/ZTu8GTDajCkZVjMs@depesz.com cases where tables are
added to a publication but end up missing rows on the subscriber. I
didn't know what might be the issue, but given his experience I decided
to take a do some blind attempts to reproduce the issue.

I'm not going to repeat all the details from the pgsql-bugs thread, but
I ended up writing a script that does randomized stress test tablesync
under concurrent load. Attached are two scripts, where crash-test.sh
does the main work, while run.sh drives the test - executes
crash-test.sh in a loop and generates random parameters for it.

The run.sh generates number of tables, refresh interval (after how many
tables we refresh subscription) and how long to sleep between steps (to
allow pgbench to do more work).

The crash-test.sh then does this:

1) initializes two clusters (expects $PATH to have pg_ctl etc.)

2) configures them for logical replication (wal_level, ...)

3) creates publication and subscription on the nodes

4) creates some a bunch of tables

5) starts a pgbench that inserts data into the tables

6) adds the tables to the publication one by one, occasionally
refreshing the subscription

7) waits for tablesync of all the tables to complete (so that the
tables get into the 'r' state, thus replicating normally)

8) stops the pgbench

9) waits for the subscriber to fully catch up

10) compares that the tables on publisher/subscriber nodes

To run this, just make sure PATH includes pg, and do e.g.

./run.sh 10

which does 10 runs of crash-test.sh with random parameters. Each run can
take a couple minutes, depending on the parameters, hardware etc.

Obviously, we expect the tables to match on the two nodes, but the
script regularly detects cases where the subscriber is missing some of
the rows. The script dumps those tables, and the rows contain timestamps
and LSNs to allow "rough correlation" (imperfect thanks to concurrency).

Depesz reported "gaps" in the data, i.e. missing a chunk of data, but
then following rows seemingly replicated. I did see such cases too, but
most of the time I see a missing chunk of rows at the end (but maybe if
the test continued a bit longer, it'd replicate some rows).

The report talks about replication between pg12->pg14, but I don't think
the cross-version part is necessary - I'm able to reproduce the issue on
individual versions (e.g. 12->12) since 12 (I haven't tried 11, but I'd
be surprised if it wasn't affected too).

The rows include `pg_current_wal_lsn()` to roughly track the LSN where
the row is inserted, and the "gap" of missing rows for each table seems
to match pg_subscription_rel.srsublsn, i.e. the LSN up to which
tablesync copied data, and the table should be replicated as usual.

Another interesting observation is that the issue only happens for "bulk
insert" transactions, i.e.

BEGIN;
... INSERT into all tables ...
COMMIT;

but not when each insert is a separate transaction. A bit strange.

After quite a bit of debugging, I came to the conclusion this happens
because we fail to invalidate caches on the publisher, so it does not
realize it should start sending rows for that table.

In particular, we initially build RelationSyncEntry when the table is
not yet included in the publication, so we end up with pubinsert=false,
thus not replicating the inserts. Which makes sense, but we then seems
to fail to invalidate the entry after it's added to the publication.

The other problem is that even if we happen to invalidate the entry, we
call GetRelationPublications(). But even if it happens long after the
table gets added to the publication (both in time and LSN terms), it
still returns NIL as if the table had no publications. And we end up
with pubinsert=false, skipping the inserts again.

Attached are three patches against master. 0001 adds some debug logging
that I found useful when investigating the issue. 0002 illustrates the
issue by forcefully invalidating the entry for each change, and
implementing a non-syscache variant of the GetRelationPublication().
This makes the code unbearably slow, but with both changes in place I
can no longer reproduce the issue. Undoing either of the two changes
makes it reproducible again. (I'll talk about 0003 later.)

I suppose timing matters, so it's possible it gets "fixed" simply
because of that, but I find that unlikely given the number of runs I did
without observing any failure.

Overall, this looks, walks and quacks like a cache invalidation issue,
likely a missing invalidation somewhere in the ALTER PUBLICATION code.
If we fail to invalidate the pg_publication_rel syscache somewhere, that
obviously explain why GetRelationPublications() returns stale data, but
it would also explain why the RelationSyncEntry is not invalidated, as
that happens in a syscache callback.

But I tried to do various crazy things in the ALTER PUBLICATION code,
and none of that worked, so I'm a bit confused/lost.

However, while randomly poking at different things, I realized that if I
change the lock obtained on the relation in OpenTableList() from
ShareUpdateExclusiveLock to ShareRowExclusiveLock, the issue goes away.
I don't know why it works, and I don't even recall what exactly led me
to the idea of changing it.

This is what 0003 does - it reverts 0002 and changes the lock level.

AFAIK the logical decoding code doesn't actually acquire locks on the
decoded tables, so why would this change matter? The only place that
does lock the relation is the tablesync, which gets RowExclusiveLock on
it. And it's interesting that RowExclusiveLock does not conflict with
ShareUpdateExclusiveLock, but does with ShareRowExclusiveLock. But why
would this even matter, when the tablesync can only touch the table
after it gets added to the publication?

regards

[1]: /messages/by-id/ZTu8GTDajCkZVjMs@depesz.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-debug-logging.patchtext/x-patch; charset=UTF-8; name=0001-debug-logging.patchDownload

From 4253c364838daf26c056c56d693bc00b1e3e8f73 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 16 Nov 2023 17:46:14 +0100
Subject: [PATCH 1/3] debug logging

---
 src/backend/catalog/pg_subscription.c       |  8 ++++++
 src/backend/replication/logical/worker.c    | 11 ++++++++
 src/backend/replication/pgoutput/pgoutput.c | 29 ++++++++++++++++++++-
 3 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index d6a978f1362..0a2a644c293 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -238,6 +238,8 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
 	bool		nulls[Natts_pg_subscription_rel];
 	Datum		values[Natts_pg_subscription_rel];
 
+	elog(LOG, "AddSubscriptionRelState relid %d state %c LSN %X/%X", relid, state, LSN_FORMAT_ARGS(sublsn));
+
 	LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
 
 	rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
@@ -285,6 +287,8 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 	Datum		values[Natts_pg_subscription_rel];
 	bool		replaces[Natts_pg_subscription_rel];
 
+	elog(LOG, "UpdateSubscriptionRelState relid %d state %c LSN %X/%X", relid, state, LSN_FORMAT_ARGS(sublsn));
+
 	LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
 
 	rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
@@ -369,6 +373,8 @@ GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn)
 
 	table_close(rel, AccessShareLock);
 
+	elog(LOG, "GetSubscriptionRelState relid %d state %c LSN %X/%X", relid, substate, LSN_FORMAT_ARGS(*sublsn));
+
 	return substate;
 }
 
@@ -531,6 +537,8 @@ GetSubscriptionRelations(Oid subid, bool not_ready)
 		else
 			relstate->lsn = DatumGetLSN(d);
 
+		elog(LOG, "GetSubscriptionRelations relid %d state %c LSN %X/%X", relstate->relid, relstate->state, LSN_FORMAT_ARGS(relstate->lsn));
+
 		res = lappend(res, relstate);
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 52a9f136ab9..42e5423172b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -503,6 +503,10 @@ should_apply_changes_for_rel(LogicalRepRelMapEntry *rel)
 			return rel->state == SUBREL_STATE_READY;
 
 		case WORKERTYPE_APPLY:
+			elog(LOG, "should_apply_changes_for_rel relid %d return %d", RelationGetRelid(rel->localrel), (rel->state == SUBREL_STATE_READY ||
+					(rel->state == SUBREL_STATE_SYNCDONE &&
+					 rel->statelsn <= remote_final_lsn)));
+
 			return (rel->state == SUBREL_STATE_READY ||
 					(rel->state == SUBREL_STATE_SYNCDONE &&
 					 rel->statelsn <= remote_final_lsn));
@@ -2398,20 +2402,27 @@ apply_handle_insert(StringInfo s)
 	MemoryContext oldctx;
 	bool		run_as_owner;
 
+	elog(LOG, "apply_handle_insert");
+
 	/*
 	 * Quick return if we are skipping data modification changes or handling
 	 * streamed transactions.
 	 */
 	if (is_skipping_changes() ||
 		handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
+	{
+		elog(LOG, "apply_handle_insert / skipping changes or streaming");
 		return;
+	}
 
 	begin_replication_step();
 
 	relid = logicalrep_read_insert(s, &newtup);
 	rel = logicalrep_rel_open(relid, RowExclusiveLock);
+	elog(LOG, "apply_handle_insert relid %d state %c lsn %X/%X final %X/%X", RelationGetRelid(rel->localrel), rel->state, LSN_FORMAT_ARGS(rel->statelsn), LSN_FORMAT_ARGS(remote_final_lsn));
 	if (!should_apply_changes_for_rel(rel))
 	{
+		elog(LOG, "apply_handle_insert relid %d skipping", RelationGetRelid(rel->localrel));
 		/*
 		 * The relation can't become interesting in the middle of the
 		 * transaction so it's safe to unlock it.
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index e8add5ee5d9..09f53bea0a0 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -1409,9 +1409,12 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	ReorderBufferChangeType action = change->action;
 	TupleTableSlot *old_slot = NULL;
 	TupleTableSlot *new_slot = NULL;
-
+elog(LOG, "pgoutput_change relid %d LSN %X/%X", RelationGetRelid(relation), LSN_FORMAT_ARGS(change->lsn));
 	if (!is_publishable_relation(relation))
+	{
+		elog(LOG, "pgoutput_change relid %d LSN %X/%X not publishable", RelationGetRelid(relation), LSN_FORMAT_ARGS(change->lsn));
 		return;
+	}
 
 	/*
 	 * Remember the xid for the change in streaming mode. We need to send xid
@@ -1429,7 +1432,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			if (!relentry->pubactions.pubinsert)
+			{
+				elog(LOG, "pgoutput_change relid %d LSN %X/%X pubinsert=false", RelationGetRelid(relation), LSN_FORMAT_ARGS(change->lsn));
 				return;
+			}
 			break;
 		case REORDER_BUFFER_CHANGE_UPDATE:
 			if (!relentry->pubactions.pubupdate)
@@ -1502,7 +1508,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	 * of the row filter for old and new tuple.
 	 */
 	if (!pgoutput_row_filter(targetrel, old_slot, &new_slot, relentry, &action))
+	{
+		elog(LOG, "pgoutput_change relid %d LSN %X/%X pgoutput_row_filter", RelationGetRelid(relation), LSN_FORMAT_ARGS(change->lsn));
 		goto cleanup;
+	}
 
 	/*
 	 * Send BEGIN if we haven't yet.
@@ -1522,10 +1531,13 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	OutputPluginPrepareWrite(ctx, true);
 
+	elog(LOG, "pgoutput_change relid %d LSN %X/%X sending data", RelationGetRelid(relation), LSN_FORMAT_ARGS(change->lsn));
+
 	/* Send the data */
 	switch (action)
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
+			elog(LOG, "pgoutput_change relid %d LSN %X/%X write insert message", RelationGetRelid(relation), LSN_FORMAT_ARGS(change->lsn));
 			logicalrep_write_insert(ctx->out, xid, targetrel, new_slot,
 									data->binary, relentry->columns);
 			break;
@@ -1974,6 +1986,7 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 	/* initialize entry, if it's new */
 	if (!found)
 	{
+		elog(LOG, "get_rel_sync_entry relation %d not found", RelationGetRelid(relation));
 		entry->replicate_valid = false;
 		entry->schema_sent = false;
 		entry->streamed_txns = NIL;
@@ -1988,6 +2001,8 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 		entry->attrmap = NULL;
 	}
 
+	elog(LOG, "get_rel_sync_entry relation %d replicate_valid %d", RelationGetRelid(relation), entry->replicate_valid);
+
 	/* Validate the entry */
 	if (!entry->replicate_valid)
 	{
@@ -2007,6 +2022,8 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 		char		relkind = get_rel_relkind(relid);
 		List	   *rel_publications = NIL;
 
+		elog(LOG, "GetRelationPublications relation %d publications %p %d", relid, pubids, list_length(pubids));
+
 		/* Reload publications if needed before use. */
 		if (!publications_valid)
 		{
@@ -2021,6 +2038,8 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 			publications_valid = true;
 		}
 
+		elog(LOG, "get_rel_sync_entry relation %d publications %d", RelationGetRelid(relation), list_length(data->publications));
+
 		/*
 		 * Reset schema_sent status as the relation definition may have
 		 * changed.  Also reset pubactions to empty in case rel was dropped
@@ -2097,6 +2116,8 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 				}
 			}
 
+			elog(LOG, "get_rel_sync_entry relation %d publish %d (A)", RelationGetRelid(relation), publish);
+
 			if (!publish)
 			{
 				bool		ancestor_published = false;
@@ -2128,12 +2149,16 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 					}
 				}
 
+				elog(LOG, "get_rel_sync_entry relation %d relation pubs %d subscription pub %d", RelationGetRelid(relation), list_length(pubids), pub->oid);
+
 				if (list_member_oid(pubids, pub->oid) ||
 					list_member_oid(schemaPubids, pub->oid) ||
 					ancestor_published)
 					publish = true;
 			}
 
+			elog(LOG, "get_rel_sync_entry relation %d publish %d (B)", RelationGetRelid(relation), publish);
+
 			/*
 			 * If the relation is to be published, determine actions to
 			 * publish, and list of columns, if appropriate.
@@ -2210,6 +2235,8 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 		entry->replicate_valid = true;
 	}
 
+	elog(LOG, "get_rel_sync_entry relation %d pubinsert %d", RelationGetRelid(relation), entry->pubactions.pubinsert);
+
 	return entry;
 }
 
-- 
2.41.0

0002-experimental-fix.patchtext/x-patch; charset=UTF-8; name=0002-experimental-fix.patchDownload

From bc400e7c9c8ad2ac4a0c431dfab9ec2f78bd5870 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 16 Nov 2023 19:57:33 +0100
Subject: [PATCH 2/3] experimental fix

---
 src/backend/replication/pgoutput/pgoutput.c | 54 ++++++++++++++++++++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 09f53bea0a0..adcd5688045 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -12,6 +12,8 @@
  */
 #include "postgres.h"
 
+#include "access/genam.h"
+#include "access/table.h"
 #include "access/tupconvert.h"
 #include "catalog/partition.h"
 #include "catalog/pg_publication.h"
@@ -29,6 +31,7 @@
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
 #include "utils/builtins.h"
+#include "utils/fmgroids.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -1958,6 +1961,51 @@ set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
 	MemoryContextSwitchTo(oldctx);
 }
 
+static List*
+GetRelationPublicationsRaw(Oid relid)
+{
+	List *result = NIL;
+	SysScanDesc scandesc;
+	Relation relation;
+	ScanKeyData key[1];
+	HeapTuple	tup;
+
+	elog(LOG, "GetRelationPublicationsRaw relid %d / start", relid);
+
+	relation = table_open(PublicationRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_publication_rel_prrelid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(relid));
+
+	scandesc = systable_beginscan(relation,
+								  PublicationRelPrrelidPrpubidIndexId,
+								  true,
+								  NULL,
+								  1,
+								  key);
+
+	while (HeapTupleIsValid(tup = systable_getnext(scandesc)))
+	{
+		Form_pg_publication_rel form;
+
+		form = (Form_pg_publication_rel) GETSTRUCT(tup);
+
+		elog(LOG, "GetRelationPublicationsRaw relid %d / tuple pub %d", relid, form->prpubid);
+
+		result = lappend_oid(result, form->prpubid);
+	}
+
+	systable_endscan(scandesc);
+
+	table_close(relation, AccessShareLock);
+
+	elog(LOG, "GetRelationPublicationsRaw relid %d / end", relid);
+
+	return result;
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -2003,11 +2051,15 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 
 	elog(LOG, "get_rel_sync_entry relation %d replicate_valid %d", RelationGetRelid(relation), entry->replicate_valid);
 
+	/* force refresh of the entry for each change */
+	entry->replicate_valid = false;
+
 	/* Validate the entry */
 	if (!entry->replicate_valid)
 	{
 		Oid			schemaId = get_rel_namespace(relid);
-		List	   *pubids = GetRelationPublications(relid);
+		List	   *pubids = GetRelationPublicationsRaw(relid);
+		// List	   *pubids = GetRelationPublications(relid);
 
 		/*
 		 * We don't acquire a lock on the namespace system table as we build
-- 
2.41.0

0003-alternative-experimental-fix-lock.patchtext/x-patch; charset=UTF-8; name=0003-alternative-experimental-fix-lock.patchDownload

From a95b5fe1d8243e16fdf5e2db5a773495f4641829 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Fri, 17 Nov 2023 11:20:50 +0100
Subject: [PATCH 3/3] alternative experimental fix - lock

---
 src/backend/commands/publicationcmds.c      |  2 +-
 src/backend/replication/pgoutput/pgoutput.c | 54 +--------------------
 2 files changed, 2 insertions(+), 54 deletions(-)

diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index f4ba572697a..7a4c315183a 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -1575,7 +1575,7 @@ OpenTableList(List *tables)
 		/* Allow query cancel in case this takes a long time */
 		CHECK_FOR_INTERRUPTS();
 
-		rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+		rel = table_openrv(t->relation, ShareRowExclusiveLock);
 		myrelid = RelationGetRelid(rel);
 
 		/*
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index adcd5688045..09f53bea0a0 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -12,8 +12,6 @@
  */
 #include "postgres.h"
 
-#include "access/genam.h"
-#include "access/table.h"
 #include "access/tupconvert.h"
 #include "catalog/partition.h"
 #include "catalog/pg_publication.h"
@@ -31,7 +29,6 @@
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
 #include "utils/builtins.h"
-#include "utils/fmgroids.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -1961,51 +1958,6 @@ set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
 	MemoryContextSwitchTo(oldctx);
 }
 
-static List*
-GetRelationPublicationsRaw(Oid relid)
-{
-	List *result = NIL;
-	SysScanDesc scandesc;
-	Relation relation;
-	ScanKeyData key[1];
-	HeapTuple	tup;
-
-	elog(LOG, "GetRelationPublicationsRaw relid %d / start", relid);
-
-	relation = table_open(PublicationRelRelationId, AccessShareLock);
-
-	ScanKeyInit(&key[0],
-				Anum_pg_publication_rel_prrelid,
-				BTEqualStrategyNumber, F_OIDEQ,
-				ObjectIdGetDatum(relid));
-
-	scandesc = systable_beginscan(relation,
-								  PublicationRelPrrelidPrpubidIndexId,
-								  true,
-								  NULL,
-								  1,
-								  key);
-
-	while (HeapTupleIsValid(tup = systable_getnext(scandesc)))
-	{
-		Form_pg_publication_rel form;
-
-		form = (Form_pg_publication_rel) GETSTRUCT(tup);
-
-		elog(LOG, "GetRelationPublicationsRaw relid %d / tuple pub %d", relid, form->prpubid);
-
-		result = lappend_oid(result, form->prpubid);
-	}
-
-	systable_endscan(scandesc);
-
-	table_close(relation, AccessShareLock);
-
-	elog(LOG, "GetRelationPublicationsRaw relid %d / end", relid);
-
-	return result;
-}
-
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -2051,15 +2003,11 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 
 	elog(LOG, "get_rel_sync_entry relation %d replicate_valid %d", RelationGetRelid(relation), entry->replicate_valid);
 
-	/* force refresh of the entry for each change */
-	entry->replicate_valid = false;
-
 	/* Validate the entry */
 	if (!entry->replicate_valid)
 	{
 		Oid			schemaId = get_rel_namespace(relid);
-		List	   *pubids = GetRelationPublicationsRaw(relid);
-		// List	   *pubids = GetRelationPublications(relid);
+		List	   *pubids = GetRelationPublications(relid);
 
 		/*
 		 * We don't acquire a lock on the namespace system table as we build
-- 
2.41.0

crash-test.shapplication/x-shellscript; name=crash-test.shDownload

run.shapplication/x-shellscript; name=run.shDownload

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Tomas Vondra (#1)

Re: long-standing data loss bug in initial sync of logical replication

Hi,

On 2023-11-17 15:36:25 +0100, Tomas Vondra wrote:

It seems there's a long-standing data loss issue related to the initial
sync of tables in the built-in logical replication (publications etc.).

Overall, this looks, walks and quacks like a cache invalidation issue,
likely a missing invalidation somewhere in the ALTER PUBLICATION code.

It could also be be that pgoutput doesn't have sufficient invalidation
handling.

One thing that looks bogus on the DDL side is how the invalidation handling
interacts with locking.

For tables etc the invalidation handling works because we hold a lock on the
relation before modifying the catalog and don't release that lock until
transaction end. That part is crucial: We queue shared invalidations at
transaction commit, *after* the transaction is marked as visible, but *before*
locks are released. That guarantees that any backend processing invalidations
will see the new contents. However, if the lock on the modified object is
released before transaction commit, other backends can build and use a cache
entry that hasn't processed invalidations (invaliations are processed when
acquiring locks).

While there is such an object for publications, it seems to be acquired too
late to actually do much good in a number of paths. And not at all in others.

E.g.:

pubform = (Form_pg_publication) GETSTRUCT(tup);

/*
* If the publication doesn't publish changes via the root partitioned
* table, the partition's row filter and column list will be used. So
* disallow using WHERE clause and column lists on partitioned table in
* this case.
*/
if (!pubform->puballtables && publish_via_partition_root_given &&
!publish_via_partition_root)
{
/*
* Lock the publication so nobody else can do anything with it. This
* prevents concurrent alter to add partitioned table(s) with WHERE
* clause(s) and/or column lists which we don't allow when not
* publishing via root.
*/
LockDatabaseObject(PublicationRelationId, pubform->oid, 0,
AccessShareLock);

a) Another session could have modified the publication and made puballtables out-of-date
b) The LockDatabaseObject() uses AccessShareLock, so others can get past this
point as well

b) seems like a copy-paste bug or such?

I don't see any locking of the publication around RemovePublicationRelById(),
for example.

I might just be misunderstanding things the way publication locking is
intended to work.

However, while randomly poking at different things, I realized that if I
change the lock obtained on the relation in OpenTableList() from
ShareUpdateExclusiveLock to ShareRowExclusiveLock, the issue goes away.

That's odd. There's cases where changing the lock level can cause invalidation
processing to happen because there is no pre-existing lock for the "new" lock
level, but there was for the old. But OpenTableList() is used when altering
the publications, so I don't see how that connects.

Greetings,

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Andres Freund (#2)

Re: long-standing data loss bug in initial sync of logical replication

Hi,

On 2023-11-17 17:54:43 -0800, Andres Freund wrote:

On 2023-11-17 15:36:25 +0100, Tomas Vondra wrote:

Overall, this looks, walks and quacks like a cache invalidation issue,
likely a missing invalidation somewhere in the ALTER PUBLICATION code.

I can confirm that something is broken with invalidation handling.

To test this I just used pg_recvlogical to stdout. It's just interesting
whether something arrives, that's easy to discern even with binary output.

CREATE PUBLICATION pb;
src/bin/pg_basebackup/pg_recvlogical --plugin=pgoutput --start --slot test -d postgres -o proto_version=4 -o publication_names=pb -o messages=true -f -

S1: CREATE TABLE d(data text not null);
S1: INSERT INTO d VALUES('d1');
S2: BEGIN; INSERT INTO d VALUES('d2');
S1: ALTER PUBLICATION pb ADD TABLE d;
S2: COMMIT
S2: INSERT INTO d VALUES('d3');
S1: INSERT INTO d VALUES('d4');
RL: <nothing>

Without the 'd2' insert in an in-progress transaction, pgoutput *does* react
to the ALTER PUBLICATION.

I think the problem here is insufficient locking. The ALTER PUBLICATION pb ADD
TABLE d basically modifies the catalog state of 'd', without a lock preventing
other sessions from having a valid cache entry that they could continue to
use. Due to this, decoding S2's transactions that started before S2's commit,
will populate the cache entry with the state as of the time of S1's last
action, i.e. no need to output the change.

The reason this can happen is because OpenTableList() uses
ShareUpdateExclusiveLock. That allows the ALTER PUBLICATION to happen while
there's an ongoing INSERT.

I think this isn't just a logical decoding issue. S2's cache state just after
the ALTER PUBLICATION is going to be wrong - the table is already locked,
therefore further operations on the table don't trigger cache invalidation
processing - but the catalog state *has* changed. It's a bigger problem for
logical decoding though, as it's a bit more lazy about invalidation processing
than normal transactions, allowing the problem to persist for longer.

I guess it's not really feasible to just increase the lock level here though
:(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL
would perhaps lead to new deadlocks and such? But it also seems quite wrong.

We could brute force this in the logical decoding infrastructure, by
distributing invalidations from catalog modifying transactions to all
concurrent in-progress transactions (like already done for historic catalog
snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()). But I think that'd
be a fairly significant increase in overhead.

Greetings,

Andres Freund

Tomas Vondra

tomas.vondra@enterprisedb.com

about 2 years ago

In reply to: Andres Freund (#2)

Re: long-standing data loss bug in initial sync of logical replication

On 11/18/23 02:54, Andres Freund wrote:

Hi,

On 2023-11-17 15:36:25 +0100, Tomas Vondra wrote:

It seems there's a long-standing data loss issue related to the initial
sync of tables in the built-in logical replication (publications etc.).

:(

Yeah :-(

Overall, this looks, walks and quacks like a cache invalidation issue,
likely a missing invalidation somewhere in the ALTER PUBLICATION code.

It could also be be that pgoutput doesn't have sufficient invalidation
handling.

I'm not sure about the details, but it can't be just about pgoutput
failing to react to some syscache invalidation. As described, just
resetting the RelationSyncEntry doesn't fix the issue - it's the
syscache that's not invalidated, IMO. But maybe that's what you mean.

One thing that looks bogus on the DDL side is how the invalidation handling
interacts with locking.

For tables etc the invalidation handling works because we hold a lock on the
relation before modifying the catalog and don't release that lock until
transaction end. That part is crucial: We queue shared invalidations at
transaction commit, *after* the transaction is marked as visible, but *before*
locks are released. That guarantees that any backend processing invalidations
will see the new contents. However, if the lock on the modified object is
released before transaction commit, other backends can build and use a cache
entry that hasn't processed invalidations (invaliations are processed when
acquiring locks).

Right.

While there is such an object for publications, it seems to be acquired too
late to actually do much good in a number of paths. And not at all in others.

E.g.:

pubform = (Form_pg_publication) GETSTRUCT(tup);

/*
* If the publication doesn't publish changes via the root partitioned
* table, the partition's row filter and column list will be used. So
* disallow using WHERE clause and column lists on partitioned table in
* this case.
*/
if (!pubform->puballtables && publish_via_partition_root_given &&
!publish_via_partition_root)
{
/*
* Lock the publication so nobody else can do anything with it. This
* prevents concurrent alter to add partitioned table(s) with WHERE
* clause(s) and/or column lists which we don't allow when not
* publishing via root.
*/
LockDatabaseObject(PublicationRelationId, pubform->oid, 0,
AccessShareLock);

a) Another session could have modified the publication and made puballtables out-of-date
b) The LockDatabaseObject() uses AccessShareLock, so others can get past this
point as well

b) seems like a copy-paste bug or such?

I don't see any locking of the publication around RemovePublicationRelById(),
for example.

I might just be misunderstanding things the way publication locking is
intended to work.

I've been asking similar questions while investigating this, but the
interactions with logical decoding (which kinda happens concurrently in
terms of WAL, but not concurrently in terms of time), historical
snapshots etc. make my head spin.

However, while randomly poking at different things, I realized that if I
change the lock obtained on the relation in OpenTableList() from
ShareUpdateExclusiveLock to ShareRowExclusiveLock, the issue goes away.

That's odd. There's cases where changing the lock level can cause invalidation
processing to happen because there is no pre-existing lock for the "new" lock
level, but there was for the old. But OpenTableList() is used when altering
the publications, so I don't see how that connects.

Yeah, I had the idea that maybe the transaction already holds the lock
on the table, and changing this to ShareRowExclusiveLock makes it
different, possibly triggering a new invalidation or something. But I
did check with gdb, and if I set a breakpoint at OpenTableList, there
are no locks on the table.

But the effect is hard to deny - if I run the test 100 times, with the
SharedUpdateExclusiveLock I get maybe 80 failures. After changing it to
ShareRowExclusiveLock I get 0. Sure, there's some randomness for cases
like this, but this is pretty unlikely.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tomas Vondra

tomas.vondra@enterprisedb.com

about 2 years ago

In reply to: Andres Freund (#3)

Re: long-standing data loss bug in initial sync of logical replication

On 11/18/23 03:54, Andres Freund wrote:

Hi,

On 2023-11-17 17:54:43 -0800, Andres Freund wrote:

On 2023-11-17 15:36:25 +0100, Tomas Vondra wrote:

Overall, this looks, walks and quacks like a cache invalidation issue,
likely a missing invalidation somewhere in the ALTER PUBLICATION code.

I can confirm that something is broken with invalidation handling.

To test this I just used pg_recvlogical to stdout. It's just interesting
whether something arrives, that's easy to discern even with binary output.

CREATE PUBLICATION pb;
src/bin/pg_basebackup/pg_recvlogical --plugin=pgoutput --start --slot test -d postgres -o proto_version=4 -o publication_names=pb -o messages=true -f -

S1: CREATE TABLE d(data text not null);
S1: INSERT INTO d VALUES('d1');
S2: BEGIN; INSERT INTO d VALUES('d2');
S1: ALTER PUBLICATION pb ADD TABLE d;
S2: COMMIT
S2: INSERT INTO d VALUES('d3');
S1: INSERT INTO d VALUES('d4');
RL: <nothing>

Without the 'd2' insert in an in-progress transaction, pgoutput *does* react
to the ALTER PUBLICATION.

I think the problem here is insufficient locking. The ALTER PUBLICATION pb ADD
TABLE d basically modifies the catalog state of 'd', without a lock preventing
other sessions from having a valid cache entry that they could continue to
use. Due to this, decoding S2's transactions that started before S2's commit,
will populate the cache entry with the state as of the time of S1's last
action, i.e. no need to output the change.

The reason this can happen is because OpenTableList() uses
ShareUpdateExclusiveLock. That allows the ALTER PUBLICATION to happen while
there's an ongoing INSERT.

I guess this would also explain why changing the lock mode from
ShareUpdateExclusiveLock to ShareRowExclusiveLock changes the behavior.
INSERT acquires RowExclusiveLock, which doesn't conflict only with the
latter.

I think this isn't just a logical decoding issue. S2's cache state just after
the ALTER PUBLICATION is going to be wrong - the table is already locked,
therefore further operations on the table don't trigger cache invalidation
processing - but the catalog state *has* changed. It's a bigger problem for
logical decoding though, as it's a bit more lazy about invalidation processing
than normal transactions, allowing the problem to persist for longer.

Yeah. I'm wondering if there's some other operation acquiring a lock
weaker than RowExclusiveLock that might be affected by this. Because
then we'd need to get an even stronger lock ...

I guess it's not really feasible to just increase the lock level here though
:(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL
would perhaps lead to new deadlocks and such? But it also seems quite wrong.

If this really is about the lock being too weak, then I don't see why
would it be wrong? If it's required for correctness, it's not really
wrong, IMO. Sure, stronger locks are not great ...

I'm not sure about the risk of deadlocks. If you do

ALTER PUBLICATION ... ADD TABLE

it's not holding many other locks. It essentially gets a lock just a
lock on pg_publication catalog, and then the publication row. That's it.

If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
we're making it conflict with RowExclusive. Which is just DML, and I
think we need to do that.

So maybe that's fine? For me, a detected deadlock is better than
silently missing some of the data.

We could brute force this in the logical decoding infrastructure, by
distributing invalidations from catalog modifying transactions to all
concurrent in-progress transactions (like already done for historic catalog
snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()). But I think that'd
be a fairly significant increase in overhead.

I have no idea what the overhead would be - perhaps not too bad,
considering catalog changes are not too common (I'm sure there are
extreme cases). And maybe we could even restrict this only to
"interesting" catalogs, or something like that? (However I hate those
weird differences in behavior, it can easily lead to bugs.)

But it feels more like a band-aid than actually fixing the issue.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Tomas Vondra (#5)

Re: long-standing data loss bug in initial sync of logical replication

Hi,

On 2023-11-18 11:56:47 +0100, Tomas Vondra wrote:

I guess it's not really feasible to just increase the lock level here though
:(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL
would perhaps lead to new deadlocks and such? But it also seems quite wrong.

If this really is about the lock being too weak, then I don't see why
would it be wrong?

Sorry, that was badly formulated. The wrong bit is the use of
ShareUpdateExclusiveLock.

If it's required for correctness, it's not really wrong, IMO. Sure, stronger
locks are not great ...

I'm not sure about the risk of deadlocks. If you do

ALTER PUBLICATION ... ADD TABLE

it's not holding many other locks. It essentially gets a lock just a
lock on pg_publication catalog, and then the publication row. That's it.

If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
we're making it conflict with RowExclusive. Which is just DML, and I
think we need to do that.

From what I can tell it needs to to be an AccessExlusiveLock. Completely
independent of logical decoding. The way the cache stays coherent is catalog
modifications conflicting with anything that builds cache entries. We have a
few cases where we do use lower level locks, but for those we have explicit
analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
could have an old view of the catalog (various CONCURRENTLY) operations.

So maybe that's fine? For me, a detected deadlock is better than
silently missing some of the data.

That certainly is true.

We could brute force this in the logical decoding infrastructure, by
distributing invalidations from catalog modifying transactions to all
concurrent in-progress transactions (like already done for historic catalog
snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()). But I think that'd
be a fairly significant increase in overhead.

I have no idea what the overhead would be - perhaps not too bad,
considering catalog changes are not too common (I'm sure there are
extreme cases). And maybe we could even restrict this only to
"interesting" catalogs, or something like that? (However I hate those
weird differences in behavior, it can easily lead to bugs.)

But it feels more like a band-aid than actually fixing the issue.

Agreed.

Greetings,

Andres Freund

Tomas Vondra

tomas.vondra@enterprisedb.com

about 2 years ago

In reply to: Andres Freund (#6)

Re: long-standing data loss bug in initial sync of logical replication

On 11/18/23 19:12, Andres Freund wrote:

Hi,

On 2023-11-18 11:56:47 +0100, Tomas Vondra wrote:

I guess it's not really feasible to just increase the lock level here though
:(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL
would perhaps lead to new deadlocks and such? But it also seems quite wrong.

If this really is about the lock being too weak, then I don't see why
would it be wrong?

Sorry, that was badly formulated. The wrong bit is the use of
ShareUpdateExclusiveLock.

Ah, you meant the current lock mode seems wrong, not that changing the
locks seems wrong. Yeah, true.

If it's required for correctness, it's not really wrong, IMO. Sure, stronger
locks are not great ...

I'm not sure about the risk of deadlocks. If you do

ALTER PUBLICATION ... ADD TABLE

it's not holding many other locks. It essentially gets a lock just a
lock on pg_publication catalog, and then the publication row. That's it.

If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
we're making it conflict with RowExclusive. Which is just DML, and I
think we need to do that.

From what I can tell it needs to to be an AccessExlusiveLock. Completely
independent of logical decoding. The way the cache stays coherent is catalog
modifications conflicting with anything that builds cache entries. We have a
few cases where we do use lower level locks, but for those we have explicit
analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
could have an old view of the catalog (various CONCURRENTLY) operations.

Yeah, I got too focused on the issue I triggered, which seems to be
fixed by using SRE (still don't understand why ...). But you're probably
right there may be other cases where SRE would not be sufficient, I
certainly can't prove it'd be safe.

So maybe that's fine? For me, a detected deadlock is better than
silently missing some of the data.

That certainly is true.

We could brute force this in the logical decoding infrastructure, by
distributing invalidations from catalog modifying transactions to all
concurrent in-progress transactions (like already done for historic catalog
snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()). But I think that'd
be a fairly significant increase in overhead.

I have no idea what the overhead would be - perhaps not too bad,
considering catalog changes are not too common (I'm sure there are
extreme cases). And maybe we could even restrict this only to
"interesting" catalogs, or something like that? (However I hate those
weird differences in behavior, it can easily lead to bugs.)

But it feels more like a band-aid than actually fixing the issue.

Agreed.

... and it would no not fix the other places outside logical decoding.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Tomas Vondra (#7)

Re: long-standing data loss bug in initial sync of logical replication

Hi,

On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote:

On 11/18/23 19:12, Andres Freund wrote:

If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
we're making it conflict with RowExclusive. Which is just DML, and I
think we need to do that.

From what I can tell it needs to to be an AccessExlusiveLock. Completely
independent of logical decoding. The way the cache stays coherent is catalog
modifications conflicting with anything that builds cache entries. We have a
few cases where we do use lower level locks, but for those we have explicit
analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
could have an old view of the catalog (various CONCURRENTLY) operations.

Yeah, I got too focused on the issue I triggered, which seems to be
fixed by using SRE (still don't understand why ...). But you're probably
right there may be other cases where SRE would not be sufficient, I
certainly can't prove it'd be safe.

I think it makes sense here: SRE prevents the problematic "scheduling" in your
test - with SRE no DML started before ALTER PUB ... ADD can commit after.

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

In a way, the logical decoding cache-invalidation situation is a lot more
atomic than the "normal" situation. During normal operation locking is
strictly required to prevent incoherent states when building a cache entry
after a transaction committed, but before the sinval entries have been
queued. But in the logical decoding case that window doesn't exist.

Greetings,

Andres Freund

Tomas Vondra

tomas.vondra@enterprisedb.com

about 2 years ago

In reply to: Andres Freund (#8)

Re: long-standing data loss bug in initial sync of logical replication

On 11/18/23 22:05, Andres Freund wrote:

Hi,

On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote:

On 11/18/23 19:12, Andres Freund wrote:

If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
we're making it conflict with RowExclusive. Which is just DML, and I
think we need to do that.

From what I can tell it needs to to be an AccessExlusiveLock. Completely
independent of logical decoding. The way the cache stays coherent is catalog
modifications conflicting with anything that builds cache entries. We have a
few cases where we do use lower level locks, but for those we have explicit
analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
could have an old view of the catalog (various CONCURRENTLY) operations.

Yeah, I got too focused on the issue I triggered, which seems to be
fixed by using SRE (still don't understand why ...). But you're probably
right there may be other cases where SRE would not be sufficient, I
certainly can't prove it'd be safe.

I think it makes sense here: SRE prevents the problematic "scheduling" in your
test - with SRE no DML started before ALTER PUB ... ADD can commit after.

If understand correctly, with the current code (which only gets
ShareUpdateExclusiveLock), we may end up in a situation like this
(sessions A and B):

A: starts "ALTER PUBLICATION p ADD TABLE t" and gets the SUE lock
A: writes the invalidation message(s) into WAL
B: inserts into table "t"
B: commit
A: commit

With the stronger SRE lock, the commits would have to happen in the
opposite order, because as you say it prevents the bad ordering.

But why would this matter for logical decoding? We accumulate the the
invalidations and execute them at transaction commit, or did I miss
something?

So what I think should happen is we get to apply B first, which won't
see the table as part of the publication. It might even build the cache
entries (syscache+relsync), reflecting that. But then we get to execute
A, along with all the invalidations, and that should invalidate them.

I'm clearly missing something, because the SRE does change the behavior,
so there has to be a difference (and by my reasoning it shouldn't be).

Or maybe it's the other way around? Won't B get the invalidation, but
use a historical snapshot that doesn't yet see the table in publication?

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

In a way, the logical decoding cache-invalidation situation is a lot more
atomic than the "normal" situation. During normal operation locking is
strictly required to prevent incoherent states when building a cache entry
after a transaction committed, but before the sinval entries have been
queued. But in the logical decoding case that window doesn't exist.

Because we apply the invalidations at commit time, so it happens as a
single operation that can't interleave with other sessions?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Tomas Vondra (#9)

Re: long-standing data loss bug in initial sync of logical replication

On 2023-11-19 02:15:33 +0100, Tomas Vondra wrote:

On 11/18/23 22:05, Andres Freund wrote:

Hi,

On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote:

On 11/18/23 19:12, Andres Freund wrote:

If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
we're making it conflict with RowExclusive. Which is just DML, and I
think we need to do that.

From what I can tell it needs to to be an AccessExlusiveLock. Completely
independent of logical decoding. The way the cache stays coherent is catalog
modifications conflicting with anything that builds cache entries. We have a
few cases where we do use lower level locks, but for those we have explicit
analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
could have an old view of the catalog (various CONCURRENTLY) operations.

Yeah, I got too focused on the issue I triggered, which seems to be
fixed by using SRE (still don't understand why ...). But you're probably
right there may be other cases where SRE would not be sufficient, I
certainly can't prove it'd be safe.

I think it makes sense here: SRE prevents the problematic "scheduling" in your
test - with SRE no DML started before ALTER PUB ... ADD can commit after.

If understand correctly, with the current code (which only gets
ShareUpdateExclusiveLock), we may end up in a situation like this
(sessions A and B):

A: starts "ALTER PUBLICATION p ADD TABLE t" and gets the SUE lock
A: writes the invalidation message(s) into WAL
B: inserts into table "t"
B: commit
A: commit

I don't think this the problematic sequence - at least it's not what I had
reproed in
/messages/by-id/20231118025445.crhaeeuvoe2g5dv6@awork3.anarazel.de

Adding line numbers:

1) S1: CREATE TABLE d(data text not null);
2) S1: INSERT INTO d VALUES('d1');
3) S2: BEGIN; INSERT INTO d VALUES('d2');
4) S1: ALTER PUBLICATION pb ADD TABLE d;
5) S2: COMMIT
6) S2: INSERT INTO d VALUES('d3');
7) S1: INSERT INTO d VALUES('d4');
8) RL: <nothing>

The problem with the sequence is that the insert from 3) is decoded *after* 4)
and that to decode the insert (which happened before the ALTER) the catalog
snapshot and cache state is from *before* the ALTER TABLE. Because the
transaction started in 3) doesn't actually modify any catalogs, no
invalidations are executed after decoding it. The result is that the cache
looks like it did at 3), not like after 4). Undesirable timetravel...

It's worth noting that here the cache state is briefly correct, after 4), it's
just that after 5) it stays the old state.

If 4) instead uses a SRE lock, then S1 will be blocked until S2 commits, and
everything is fine.

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

No, I mean that I don't know if using SRE instead of AE would have negative
consequences for logical decoding. I.e. whether, from a logical decoding POV,
it'd suffice to increase the lock level to just SRE instead of AE.

Since I don't see how it'd be correct otherwise, it's kind of a moot question.

In a way, the logical decoding cache-invalidation situation is a lot more
atomic than the "normal" situation. During normal operation locking is
strictly required to prevent incoherent states when building a cache entry
after a transaction committed, but before the sinval entries have been
queued. But in the logical decoding case that window doesn't exist.

Because we apply the invalidations at commit time, so it happens as a
single operation that can't interleave with other sessions?

Yea, the situation is much simpler during logical decoding than "originally" -
there's no concurrency.

Greetings,

Andres Freund

#11

Vadim Lakt

vadim.lakt@gmail.com

almost 2 years ago

In reply to: Andres Freund (#10)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

Hi,

On 19.11.2023 09:18, Andres Freund wrote:

Yea, the situation is much simpler during logical decoding than "originally" -
there's no concurrency.

Greetings,

Andres Freund

We've encountered a similar error on our industrial server.

The case: After adding a table to logical replication, table
initialization proceeds normally, but new data from the publisher's
table does not appear on the subscriber server. After we added the
table, we checked and saw that the data was present on the subscriber
and everything was normal, we discovered the error after some time. I
have attached scripts to the email.

The patch from the first message also solves this problem.

--
Best regards,
Vadim Lakt

#12

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Andres Freund (#10)

Re: long-standing data loss bug in initial sync of logical replication

On Sun, Nov 19, 2023 at 7:48 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-11-19 02:15:33 +0100, Tomas Vondra wrote:

If understand correctly, with the current code (which only gets
ShareUpdateExclusiveLock), we may end up in a situation like this
(sessions A and B):

A: starts "ALTER PUBLICATION p ADD TABLE t" and gets the SUE lock
A: writes the invalidation message(s) into WAL
B: inserts into table "t"
B: commit
A: commit

I don't think this the problematic sequence - at least it's not what I had
reproed in
/messages/by-id/20231118025445.crhaeeuvoe2g5dv6@awork3.anarazel.de

Adding line numbers:

1) S1: CREATE TABLE d(data text not null);
2) S1: INSERT INTO d VALUES('d1');
3) S2: BEGIN; INSERT INTO d VALUES('d2');
4) S1: ALTER PUBLICATION pb ADD TABLE d;
5) S2: COMMIT
6) S2: INSERT INTO d VALUES('d3');
7) S1: INSERT INTO d VALUES('d4');
8) RL: <nothing>

The problem with the sequence is that the insert from 3) is decoded *after* 4)
and that to decode the insert (which happened before the ALTER) the catalog
snapshot and cache state is from *before* the ALTER TABLE. Because the
transaction started in 3) doesn't actually modify any catalogs, no
invalidations are executed after decoding it. The result is that the cache
looks like it did at 3), not like after 4). Undesirable timetravel...

It's worth noting that here the cache state is briefly correct, after 4), it's
just that after 5) it stays the old state.

If 4) instead uses a SRE lock, then S1 will be blocked until S2 commits, and
everything is fine.

I agree, your analysis looks right to me.

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

No, I mean that I don't know if using SRE instead of AE would have negative
consequences for logical decoding. I.e. whether, from a logical decoding POV,
it'd suffice to increase the lock level to just SRE instead of AE.

Since I don't see how it'd be correct otherwise, it's kind of a moot question.

We lost track of this thread and the bug is still open. IIUC, the
conclusion is to use SRE in OpenTableList() to fix the reported issue.
Andres, Tomas, please let me know if my understanding is wrong,
otherwise, let's proceed and fix this issue.

--
With Regards,
Amit Kapila.

#13

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Amit Kapila (#12)

Re: long-standing data loss bug in initial sync of logical replication

On 6/24/24 12:54, Amit Kapila wrote:

...

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

No, I mean that I don't know if using SRE instead of AE would have negative
consequences for logical decoding. I.e. whether, from a logical decoding POV,
it'd suffice to increase the lock level to just SRE instead of AE.

Since I don't see how it'd be correct otherwise, it's kind of a moot question.

We lost track of this thread and the bug is still open. IIUC, the
conclusion is to use SRE in OpenTableList() to fix the reported issue.
Andres, Tomas, please let me know if my understanding is wrong,
otherwise, let's proceed and fix this issue.

It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I
don't think we 'lost track' of it, but it's true we haven't done much
progress recently.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#13)

Re: long-standing data loss bug in initial sync of logical replication

On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/24/24 12:54, Amit Kapila wrote:

...

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

No, I mean that I don't know if using SRE instead of AE would have negative
consequences for logical decoding. I.e. whether, from a logical decoding POV,
it'd suffice to increase the lock level to just SRE instead of AE.

Since I don't see how it'd be correct otherwise, it's kind of a moot question.

We lost track of this thread and the bug is still open. IIUC, the
conclusion is to use SRE in OpenTableList() to fix the reported issue.
Andres, Tomas, please let me know if my understanding is wrong,
otherwise, let's proceed and fix this issue.

It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I
don't think we 'lost track' of it, but it's true we haven't done much
progress recently.

Okay, thanks for pointing to the CF entry. Would you like to take care
of this? Are you seeing anything more than the simple fix to use SRE
in OpenTableList()?

--
With Regards,
Amit Kapila.

#15

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Amit Kapila (#14)

Re: long-standing data loss bug in initial sync of logical replication

On 6/25/24 07:04, Amit Kapila wrote:

On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/24/24 12:54, Amit Kapila wrote:

...

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

No, I mean that I don't know if using SRE instead of AE would have negative
consequences for logical decoding. I.e. whether, from a logical decoding POV,
it'd suffice to increase the lock level to just SRE instead of AE.

Since I don't see how it'd be correct otherwise, it's kind of a moot question.

We lost track of this thread and the bug is still open. IIUC, the
conclusion is to use SRE in OpenTableList() to fix the reported issue.
Andres, Tomas, please let me know if my understanding is wrong,
otherwise, let's proceed and fix this issue.

It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I
don't think we 'lost track' of it, but it's true we haven't done much
progress recently.

Okay, thanks for pointing to the CF entry. Would you like to take care
of this? Are you seeing anything more than the simple fix to use SRE
in OpenTableList()?

I did not find a simpler fix than adding the SRE, and I think pretty
much any other fix is guaranteed to be more complex. I don't remember
all the details without relearning all the details, but IIRC the main
challenge for me was to convince myself it's a sufficient and reliable
fix (and not working simply by chance).

I won't have time to look into this anytime soon, so feel free to take
care of this and push the fix.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#15)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, Jun 26, 2024 at 4:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/25/24 07:04, Amit Kapila wrote:

On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/24/24 12:54, Amit Kapila wrote:

...

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

No, I mean that I don't know if using SRE instead of AE would have negative
consequences for logical decoding. I.e. whether, from a logical decoding POV,
it'd suffice to increase the lock level to just SRE instead of AE.

Since I don't see how it'd be correct otherwise, it's kind of a moot question.

We lost track of this thread and the bug is still open. IIUC, the
conclusion is to use SRE in OpenTableList() to fix the reported issue.
Andres, Tomas, please let me know if my understanding is wrong,
otherwise, let's proceed and fix this issue.

It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I
don't think we 'lost track' of it, but it's true we haven't done much
progress recently.

Okay, thanks for pointing to the CF entry. Would you like to take care
of this? Are you seeing anything more than the simple fix to use SRE
in OpenTableList()?

I did not find a simpler fix than adding the SRE, and I think pretty
much any other fix is guaranteed to be more complex. I don't remember
all the details without relearning all the details, but IIRC the main
challenge for me was to convince myself it's a sufficient and reliable
fix (and not working simply by chance).

I won't have time to look into this anytime soon, so feel free to take
care of this and push the fix.

Okay, I'll take care of this.

--
With Regards,
Amit Kapila.

#17

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: Amit Kapila (#16)

3 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Thu, 27 Jun 2024 at 08:38, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 26, 2024 at 4:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/25/24 07:04, Amit Kapila wrote:

On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/24/24 12:54, Amit Kapila wrote:

...

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

No, I mean that I don't know if using SRE instead of AE would have negative
consequences for logical decoding. I.e. whether, from a logical decoding POV,
it'd suffice to increase the lock level to just SRE instead of AE.

Since I don't see how it'd be correct otherwise, it's kind of a moot question.

We lost track of this thread and the bug is still open. IIUC, the
conclusion is to use SRE in OpenTableList() to fix the reported issue.
Andres, Tomas, please let me know if my understanding is wrong,
otherwise, let's proceed and fix this issue.

It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I
don't think we 'lost track' of it, but it's true we haven't done much
progress recently.

Okay, thanks for pointing to the CF entry. Would you like to take care
of this? Are you seeing anything more than the simple fix to use SRE
in OpenTableList()?

I did not find a simpler fix than adding the SRE, and I think pretty
much any other fix is guaranteed to be more complex. I don't remember
all the details without relearning all the details, but IIRC the main
challenge for me was to convince myself it's a sufficient and reliable
fix (and not working simply by chance).

I won't have time to look into this anytime soon, so feel free to take
care of this and push the fix.

Okay, I'll take care of this.

This issue is present in all supported versions. I was able to
reproduce it using the steps recommended by Andres and Tomas's
scripts. I also conducted a small test through TAP tests to verify the
problem. Attached is the alternate_lock_HEAD.patch, which includes the
lock modification(Tomas's change) and the TAP test.
To reproduce the issue in the HEAD version, we cannot use the same
test as in the alternate_lock_HEAD patch because the behavior changes
slightly after the fix to wait for the lock until the open transaction
completes. The attached issue_reproduce_testcase_head.patch can be
used to reproduce the issue through TAP test in HEAD.
The changes made in the HEAD version do not directly apply to older
branches. For PG14, PG13, and PG12 branches, you can use the
alternate_lock_PG14.patch.

Regards,
Vignesh

Attachments:

alternate_lock_HEAD.patchtext/x-patch; charset=US-ASCII; name=alternate_lock_HEAD.patchDownload

diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 6ea709988e..f4e5745909 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -1568,7 +1568,7 @@ OpenTableList(List *tables)
 		/* Allow query cancel in case this takes a long time */
 		CHECK_FOR_INTERRUPTS();
 
-		rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+		rel = table_openrv(t->relation, ShareRowExclusiveLock);
 		myrelid = RelationGetRelid(rel);
 
 		/*
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..3316e57ff5 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,78 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Incremental data synchronization skipped when a new table is added, if
+# there is a concurrent active transaction involving the same table.
+
+# Create table in publisher and subscriber.
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_conc(a int)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_conc(a int)");
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate a background session that keeps a transaction active.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+# Maintain an active transaction with the table.
+$background_psql1->set_query_timer_restart();
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will wait for the lock and can only be completed after
+# the previous open transaction is committed.
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+# This operation will wait because there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION pub1 ADD TABLE tab_conc;\n");
+
+# Verify that the table addition is waiting to acquire a ShareRowExclusiveLock.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab_conc'::regclass AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+$background_psql1->query_safe(qq[INSERT INTO tab_conc VALUES (2)]);
+$background_psql1->quit;
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2), 'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed');
+
+# Perform an insert.
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_conc values(3)");
+$node_publisher->wait_for_catchup('sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3), 'Verify that the incremental data added after table synchronization is replicated to the subscriber');
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');

issue_reproduce_testcase_head.patchtext/x-patch; charset=US-ASCII; name=issue_reproduce_testcase_head.patchDownload

diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..9fde78e1b9 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,72 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Incremental data synchronization skipped when a new table is added, if
+# there is a concurrent active transaction involving the same table.
+
+# Create table in publisher and subscriber.
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_conc(a int)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_conc(a int)");
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate a background session that keeps a transaction active.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+# Maintain an active transaction with the table.
+$background_psql1->set_query_timer_restart();
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Add the table to the publication from background_psql
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+# This will wait as the open transaction holding a lock.
+$background_psql2->query_until(qr//, "ALTER PUBLICATION pub1 ADD TABLE tab_conc;\n");
+
+$node_publisher->poll_query_until('postgres',
+"SELECT COUNT(1) = 1 FROM pg_publication_rel WHERE prrelid = 'tab_conc'::regclass;"
+  )
+  or die
+  "Timed out while waiting for the table tab_conc is added to pg_publication_rel";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+$background_psql1->query_safe(qq[INSERT INTO tab_conc VALUES (2)]);
+$background_psql1->quit;
+
+# Refresh the publication
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2), 'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed');
+
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_conc values(3)");
+$node_publisher->wait_for_catchup('sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3), 'Verify that the incremental data added after table synchronization is replicated to the subscriber');
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');

alternate_lock_PG14.patchtext/x-patch; charset=US-ASCII; name=alternate_lock_PG14.patchDownload

diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 7ee8825522..55e8cbfdc9 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -571,7 +571,7 @@ OpenTableList(List *tables)
 		/* Allow query cancel in case this takes a long time */
 		CHECK_FOR_INTERRUPTS();
 
-		rel = table_openrv(rv, ShareUpdateExclusiveLock);
+		rel = table_openrv(rv, ShareRowExclusiveLock);
 		myrelid = RelationGetRelid(rel);
 
 		/*

#18

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: vignesh C (#17)

Re: long-standing data loss bug in initial sync of logical replication

On Mon, Jul 1, 2024 at 10:51 AM vignesh C <vignesh21@gmail.com> wrote:

This issue is present in all supported versions. I was able to
reproduce it using the steps recommended by Andres and Tomas's
scripts. I also conducted a small test through TAP tests to verify the
problem. Attached is the alternate_lock_HEAD.patch, which includes the
lock modification(Tomas's change) and the TAP test.

@@ -1568,7 +1568,7 @@ OpenTableList(List *tables)
/* Allow query cancel in case this takes a long time */
CHECK_FOR_INTERRUPTS();

- rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+ rel = table_openrv(t->relation, ShareRowExclusiveLock);

The comment just above this code ("Open, share-lock, and check all the
explicitly-specified relations") needs modification. It would be
better to explain the reason of why we would need SRE lock here.

To reproduce the issue in the HEAD version, we cannot use the same
test as in the alternate_lock_HEAD patch because the behavior changes
slightly after the fix to wait for the lock until the open transaction
completes.

But won't the test that reproduces the problem in HEAD be successful
after the code change? If so, can't we use the same test instead of
slight modification to verify the lock mode?

The attached issue_reproduce_testcase_head.patch can be
used to reproduce the issue through TAP test in HEAD.
The changes made in the HEAD version do not directly apply to older
branches. For PG14, PG13, and PG12 branches, you can use the
alternate_lock_PG14.patch.

Why didn't you include the test in the back branches? If it is due to
background psql stuff, then won't commit
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=187b8991f70fc3d2a13dc709edd408a8df0be055)
can address it?

--
With Regards,
Amit Kapila.

#19

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: Amit Kapila (#18)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, 9 Jul 2024 at 17:05, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 1, 2024 at 10:51 AM vignesh C <vignesh21@gmail.com> wrote:

This issue is present in all supported versions. I was able to
reproduce it using the steps recommended by Andres and Tomas's
scripts. I also conducted a small test through TAP tests to verify the
problem. Attached is the alternate_lock_HEAD.patch, which includes the
lock modification(Tomas's change) and the TAP test.

@@ -1568,7 +1568,7 @@ OpenTableList(List *tables)
/* Allow query cancel in case this takes a long time */
CHECK_FOR_INTERRUPTS();
- rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+ rel = table_openrv(t->relation, ShareRowExclusiveLock);
The comment just above this code ("Open, share-lock, and check all the
explicitly-specified relations") needs modification. It would be
better to explain the reason of why we would need SRE lock here.

Updated comments for the same.

To reproduce the issue in the HEAD version, we cannot use the same
test as in the alternate_lock_HEAD patch because the behavior changes
slightly after the fix to wait for the lock until the open transaction
completes.

But won't the test that reproduces the problem in HEAD be successful
after the code change? If so, can't we use the same test instead of
slight modification to verify the lock mode?

Before the patch fix, the ALTER PUBLICATION command would succeed
immediately. Now, the ALTER PUBLICATION command waits until it
acquires the ShareRowExclusiveLock. This change means that in test
cases, previously we waited until the table was added to the
publication, whereas now, after applying the patch, we wait until the
ALTER PUBLICATION command is actively waiting for the
ShareRowExclusiveLock. This waiting step ensures consistent execution
and sequencing of tests each time.

The attached issue_reproduce_testcase_head.patch can be
used to reproduce the issue through TAP test in HEAD.
The changes made in the HEAD version do not directly apply to older
branches. For PG14, PG13, and PG12 branches, you can use the
alternate_lock_PG14.patch.

Why didn't you include the test in the back branches? If it is due to
background psql stuff, then won't commit
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=187b8991f70fc3d2a13dc709edd408a8df0be055)
can address it?

Indeed, I initially believed it wasn't available. Currently, I haven't
incorporated the back branch patch, but I plan to include it in a
subsequent version once there are no review comments on the HEAD
patch.

The updated v2 version patch has the fix for the comments.

Regards,
Vignesh

Attachments:

v2-0001-Fix-random-data-loss-during-logical-replication.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Fix-random-data-loss-during-logical-replication.patchDownload

From be5eb6701e232dd475cdc2639c423bcbe638b068 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Tue, 9 Jul 2024 19:23:10 +0530
Subject: [PATCH v2] Fix random data loss during logical replication.

Previously, when adding tables to a publication in PostgreSQL, they were
locked using ShareUpdateExclusiveLock mode. This mode allowed the lock to
succeed even if there were ongoing transactions on the table. As a
consequence, the ALTER PUBLICATION command could complete before these
transactions, leading to a scenario where the catalog snapshot used for
replication did not include changes from transactions initiated before
the alteration.

To fix this issue, the locking mechanism has been revised. Tables are
now locked using ShareRowExclusiveLock mode during the addition to a
publication. This adjustment ensures that the ALTER PUBLICATION command
waits for any ongoing transactions on these tables to complete before
proceeding. As a result, transactions initiated before the publication
alteration are correctly included in the replication process, maintaining
data consistency across replicas.
---
 src/backend/commands/publicationcmds.c | 12 +++--
 src/test/subscription/t/100_bugs.pl    | 72 ++++++++++++++++++++++++++
 2 files changed, 81 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 6ea709988e..be1fcf4f58 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -1542,8 +1542,14 @@ RemovePublicationSchemaById(Oid psoid)
 
 /*
  * Open relations specified by a PublicationTable list.
- * The returned tables are locked in ShareUpdateExclusiveLock mode in order to
- * add them to a publication.
+ *
+ * The returned tables are locked in ShareRowExclusiveLock mode in order to
+ * add them to a publication. The table needs to be locked in
+ * ShareRowExclusiveLock mode to ensure that any ongoing transactions involving
+ * the table are completed before adding it to the publication. Failing to do
+ * so means that transactions initiated before the alteration of the
+ * publication will continue to use a catalog snapshot predating the
+ * publication change, leading to non-replication of these transaction changes.
  */
 static List *
 OpenTableList(List *tables)
@@ -1568,7 +1574,7 @@ OpenTableList(List *tables)
 		/* Allow query cancel in case this takes a long time */
 		CHECK_FOR_INTERRUPTS();
 
-		rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+		rel = table_openrv(t->relation, ShareRowExclusiveLock);
 		myrelid = RelationGetRelid(rel);
 
 		/*
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..3316e57ff5 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,78 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Incremental data synchronization skipped when a new table is added, if
+# there is a concurrent active transaction involving the same table.
+
+# Create table in publisher and subscriber.
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_conc(a int)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_conc(a int)");
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate a background session that keeps a transaction active.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+# Maintain an active transaction with the table.
+$background_psql1->set_query_timer_restart();
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will wait for the lock and can only be completed after
+# the previous open transaction is committed.
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+# This operation will wait because there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION pub1 ADD TABLE tab_conc;\n");
+
+# Verify that the table addition is waiting to acquire a ShareRowExclusiveLock.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab_conc'::regclass AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+$background_psql1->query_safe(qq[INSERT INTO tab_conc VALUES (2)]);
+$background_psql1->quit;
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2), 'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed');
+
+# Perform an insert.
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_conc values(3)");
+$node_publisher->wait_for_catchup('sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3), 'Verify that the incremental data added after table synchronization is replicated to the subscriber');
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.34.1

#20

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: vignesh C (#19)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, Jul 9, 2024 at 8:14 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 9 Jul 2024 at 17:05, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Jul 1, 2024 at 10:51 AM vignesh C <vignesh21@gmail.com> wrote:

This issue is present in all supported versions. I was able to
reproduce it using the steps recommended by Andres and Tomas's
scripts. I also conducted a small test through TAP tests to verify the
problem. Attached is the alternate_lock_HEAD.patch, which includes the
lock modification(Tomas's change) and the TAP test.

@@ -1568,7 +1568,7 @@ OpenTableList(List *tables)
/* Allow query cancel in case this takes a long time */
CHECK_FOR_INTERRUPTS();
- rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+ rel = table_openrv(t->relation, ShareRowExclusiveLock);
The comment just above this code ("Open, share-lock, and check all the
explicitly-specified relations") needs modification. It would be
better to explain the reason of why we would need SRE lock here.
Updated comments for the same.

The patch missed to use the ShareRowExclusiveLock for partitions, see
attached. I haven't tested it but they should also face the same
problem. Apart from that, I have changed the comments in a few places
in the patch.

--
With Regards,
Amit Kapila.

Attachments:

v3-0001-Fix-data-loss-during-initial-sync-in-logical-repl.patchapplication/octet-stream; name=v3-0001-Fix-data-loss-during-initial-sync-in-logical-repl.patchDownload

From 92422b204345ea8078be1c444d3014debe583b65 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Tue, 9 Jul 2024 19:23:10 +0530
Subject: [PATCH v3] Fix data loss during initial sync in logical replication.

Previously, when adding tables to a publication in PostgreSQL, they were
locked using ShareUpdateExclusiveLock mode. This mode allowed the lock to
succeed even if there were ongoing DML transactions on that table. As a
consequence, the ALTER PUBLICATION command could be completed before these
transactions, leading to a scenario where the catalog snapshot used for
replication did not include changes from transactions initiated before the
alteration.

To fix this issue, tables are now locked using ShareRowExclusiveLock mode
during the addition to a publication. This change ensures that the
ALTER PUBLICATION command waits for any ongoing transactions on the tables
(to be added to the publication) to be completed before proceeding. As a
result, transactions initiated before the publication alteration are
correctly included in the replication process.

Reported-by: Tomas Vondra
Diagnosed-by: Andres Freund
Author: Vignesh C, Tomas Vondra
Reviewed-by: Amit Kapila
Backpatch-through: 12
Discussion: https://postgr.es/m/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com
---
 src/backend/commands/publicationcmds.c | 16 ++++--
 src/test/subscription/t/100_bugs.pl    | 73 ++++++++++++++++++++++++++
 2 files changed, 84 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 6ea709988e..341ea0318d 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -1542,8 +1542,14 @@ RemovePublicationSchemaById(Oid psoid)
 
 /*
  * Open relations specified by a PublicationTable list.
- * The returned tables are locked in ShareUpdateExclusiveLock mode in order to
- * add them to a publication.
+ *
+ * The returned tables are locked in ShareRowExclusiveLock mode to add them
+ * to a publication. The table needs to be locked in ShareRowExclusiveLock
+ * mode to ensure that any ongoing transactions involving that table are
+ * completed before adding it to the publication. Otherwise, the transaction
+ * initiated before the alteration of the publication will continue to use a
+ * catalog snapshot predating the publication change, leading to
+ * non-replication of these transaction changes.
  */
 static List *
 OpenTableList(List *tables)
@@ -1568,7 +1574,7 @@ OpenTableList(List *tables)
 		/* Allow query cancel in case this takes a long time */
 		CHECK_FOR_INTERRUPTS();
 
-		rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+		rel = table_openrv(t->relation, ShareRowExclusiveLock);
 		myrelid = RelationGetRelid(rel);
 
 		/*
@@ -1594,7 +1600,7 @@ OpenTableList(List *tables)
 						 errmsg("conflicting or redundant column lists for table \"%s\"",
 								RelationGetRelationName(rel))));
 
-			table_close(rel, ShareUpdateExclusiveLock);
+			table_close(rel, ShareRowExclusiveLock);
 			continue;
 		}
 
@@ -1622,7 +1628,7 @@ OpenTableList(List *tables)
 			List	   *children;
 			ListCell   *child;
 
-			children = find_all_inheritors(myrelid, ShareUpdateExclusiveLock,
+			children = find_all_inheritors(myrelid, ShareRowExclusiveLock,
 										   NULL);
 
 			foreach(child, children)
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..01ae1edfb2 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,79 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Create table in publisher and subscriber.
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_conc(a int)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_conc(a int)");
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate a background session that keeps a transaction active.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+# Maintain an active transaction with the table.
+$background_psql1->set_query_timer_restart();
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will wait for the lock and can only be completed after
+# the previous open transaction is committed.
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+# This operation will wait because there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION pub1 ADD TABLE tab_conc;\n");
+
+# Verify that the table addition is waiting to acquire a ShareRowExclusiveLock.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab_conc'::regclass AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+$background_psql1->query_safe(qq[INSERT INTO tab_conc VALUES (2)]);
+$background_psql1->quit;
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2), 'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed');
+
+# Perform an insert.
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_conc values(3)");
+$node_publisher->wait_for_catchup('sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3), 'Verify that the incremental data added after table synchronization is replicated to the subscriber');
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.28.0.windows.1

#21

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: Amit Kapila (#20)

6 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, 10 Jul 2024 at 12:28, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 9, 2024 at 8:14 PM vignesh C <vignesh21@gmail.com> wrote:
On Tue, 9 Jul 2024 at 17:05, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Jul 1, 2024 at 10:51 AM vignesh C <vignesh21@gmail.com> wrote:

This issue is present in all supported versions. I was able to
reproduce it using the steps recommended by Andres and Tomas's
scripts. I also conducted a small test through TAP tests to verify the
problem. Attached is the alternate_lock_HEAD.patch, which includes the
lock modification(Tomas's change) and the TAP test.

@@ -1568,7 +1568,7 @@ OpenTableList(List *tables)
/* Allow query cancel in case this takes a long time */
CHECK_FOR_INTERRUPTS();
- rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+ rel = table_openrv(t->relation, ShareRowExclusiveLock);
The comment just above this code ("Open, share-lock, and check all the
explicitly-specified relations") needs modification. It would be
better to explain the reason of why we would need SRE lock here.
Updated comments for the same.
The patch missed to use the ShareRowExclusiveLock for partitions, see
attached. I haven't tested it but they should also face the same
problem. Apart from that, I have changed the comments in a few places
in the patch.

I could not hit the updated ShareRowExclusiveLock changes through the
partition table, instead I could verify it using the inheritance
table. Added a test for the same and also attaching the backbranch
patch.

Regards,
Vignesh

Attachments:

v4-0001-Fix-data-loss-during-initial-sync-in-logical-repl_HEAD.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Fix-data-loss-during-initial-sync-in-logical-repl_HEAD.patchDownload

From d300868b61c65a6b575078c29c0d20994acae1fa Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Tue, 9 Jul 2024 19:23:10 +0530
Subject: [PATCH v4] Fix data loss during initial sync in logical replication.

Previously, when adding tables to a publication in PostgreSQL, they were
locked using ShareUpdateExclusiveLock mode. This mode allowed the lock to
succeed even if there were ongoing DML transactions on that table. As a
consequence, the ALTER PUBLICATION command could be completed before these
transactions, leading to a scenario where the catalog snapshot used for
replication did not include changes from transactions initiated before the
alteration.

To fix this issue, tables are now locked using ShareRowExclusiveLock mode
during the addition to a publication. This change ensures that the
ALTER PUBLICATION command waits for any ongoing transactions on the tables
(to be added to the publication) to be completed before proceeding. As a
result, transactions initiated before the publication alteration are
correctly included in the replication process.

Reported-by: Tomas Vondra
Diagnosed-by: Andres Freund
Author: Vignesh C, Tomas Vondra
Reviewed-by: Amit Kapila
Backpatch-through: 12
Discussion: https://postgr.es/m/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com
---
 src/backend/commands/publicationcmds.c |  16 ++-
 src/test/subscription/t/100_bugs.pl    | 142 +++++++++++++++++++++++++
 2 files changed, 153 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 6ea709988e..341ea0318d 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -1542,8 +1542,14 @@ RemovePublicationSchemaById(Oid psoid)
 
 /*
  * Open relations specified by a PublicationTable list.
- * The returned tables are locked in ShareUpdateExclusiveLock mode in order to
- * add them to a publication.
+ *
+ * The returned tables are locked in ShareRowExclusiveLock mode to add them
+ * to a publication. The table needs to be locked in ShareRowExclusiveLock
+ * mode to ensure that any ongoing transactions involving that table are
+ * completed before adding it to the publication. Otherwise, the transaction
+ * initiated before the alteration of the publication will continue to use a
+ * catalog snapshot predating the publication change, leading to
+ * non-replication of these transaction changes.
  */
 static List *
 OpenTableList(List *tables)
@@ -1568,7 +1574,7 @@ OpenTableList(List *tables)
 		/* Allow query cancel in case this takes a long time */
 		CHECK_FOR_INTERRUPTS();
 
-		rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+		rel = table_openrv(t->relation, ShareRowExclusiveLock);
 		myrelid = RelationGetRelid(rel);
 
 		/*
@@ -1594,7 +1600,7 @@ OpenTableList(List *tables)
 						 errmsg("conflicting or redundant column lists for table \"%s\"",
 								RelationGetRelationName(rel))));
 
-			table_close(rel, ShareUpdateExclusiveLock);
+			table_close(rel, ShareRowExclusiveLock);
 			continue;
 		}
 
@@ -1622,7 +1628,7 @@ OpenTableList(List *tables)
 			List	   *children;
 			ListCell   *child;
 
-			children = find_all_inheritors(myrelid, ShareUpdateExclusiveLock,
+			children = find_all_inheritors(myrelid, ShareRowExclusiveLock,
 										   NULL);
 
 			foreach(child, children)
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..670c574547 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,148 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Create tables in publisher and subscriber.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE TABLE tab1_conc(a int);
+	CREATE TABLE tab1_conc_child() inherits (tab1_conc);
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE TABLE tab1_conc(a int);
+	CREATE TABLE tab1_conc_child() inherits (tab1_conc);
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate a background session that keeps a transaction active.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+# Maintain an active transaction with the table.
+$background_psql1->set_query_timer_restart();
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will wait for the lock and can only be completed after
+# the previous open transaction is committed.
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+# This operation will wait because there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION pub1 ADD TABLE tab_conc;\n");
+
+# Verify that the table addition is waiting to acquire a ShareRowExclusiveLock.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab_conc'::regclass AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+
+# Maintain an active transaction with inheritance table.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab1_conc_child VALUES (1);
+]);
+
+# Add an inheritance table to the publication, this operation will wait because
+# there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION pub1 ADD TABLE tab1_conc;\n");
+
+# Verify that the child table addition is waiting to acquire a
+# ShareRowExclusiveLock.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab1_conc_child'::regclass AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+# Wait till the tables are added to the publication.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 2 FROM pg_publication_rel WHERE prrelid IN ('tab_conc'::regclass, 'tab1_conc'::regclass);"
+  )
+  or die
+  "Timed out while waiting for alter publication to add the table to the publication";
+
+$background_psql1->quit;
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO tab1_conc_child VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab1_conc_child table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO tab1_conc_child VALUES (3);
+));
+$node_publisher->wait_for_catchup('sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab1_conc_child added after table synchronization is replicated to the subscriber'
+);
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.34.1

v4-0001-Fix-data-loss-during-initial-sync-in-logical-repl_PG14.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Fix-data-loss-during-initial-sync-in-logical-repl_PG14.patchDownload

From 654c8d3f6f599889c628090194aa28639bf3430d Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Wed, 10 Jul 2024 21:43:43 +0530
Subject: [PATCH v5] Fix data loss during initial sync in logical replication.

Previously, when adding tables to a publication in PostgreSQL, they were
locked using ShareUpdateExclusiveLock mode. This mode allowed the lock to
succeed even if there were ongoing DML transactions on that table. As a
consequence, the ALTER PUBLICATION command could be completed before these
transactions, leading to a scenario where the catalog snapshot used for
replication did not include changes from transactions initiated before the
alteration.

To fix this issue, tables are now locked using ShareRowExclusiveLock mode
during the addition to a publication. This change ensures that the
ALTER PUBLICATION command waits for any ongoing transactions on the tables
(to be added to the publication) to be completed before proceeding. As a
result, transactions initiated before the publication alteration are
correctly included in the replication process.

Reported-by: Tomas Vondra
Diagnosed-by: Andres Freund
Author: Vignesh C, Tomas Vondra
Reviewed-by: Amit Kapila
Backpatch-through: 12
Discussion: https://postgr.es/m/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com
---
 src/backend/commands/publicationcmds.c |  16 ++-
 src/test/subscription/t/100_bugs.pl    | 161 ++++++++++++++++++++++++-
 2 files changed, 171 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index e288dd41cd..0215f3999f 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -547,8 +547,14 @@ RemovePublicationById(Oid pubid)
 
 /*
  * Open relations specified by a RangeVar list.
- * The returned tables are locked in ShareUpdateExclusiveLock mode in order to
- * add them to a publication.
+ *
+ * The returned tables are locked in ShareRowExclusiveLock mode to add them
+ * to a publication. The table needs to be locked in ShareRowExclusiveLock
+ * mode to ensure that any ongoing transactions involving that table are
+ * completed before adding it to the publication. Otherwise, the transaction
+ * initiated before the alteration of the publication will continue to use a
+ * catalog snapshot predating the publication change, leading to
+ * non-replication of these transaction changes.
  */
 static List *
 OpenTableList(List *tables)
@@ -570,7 +576,7 @@ OpenTableList(List *tables)
 		/* Allow query cancel in case this takes a long time */
 		CHECK_FOR_INTERRUPTS();
 
-		rel = table_openrv(rv, ShareUpdateExclusiveLock);
+		rel = table_openrv(rv, ShareRowExclusiveLock);
 		myrelid = RelationGetRelid(rel);
 
 		/*
@@ -582,7 +588,7 @@ OpenTableList(List *tables)
 		 */
 		if (list_member_oid(relids, myrelid))
 		{
-			table_close(rel, ShareUpdateExclusiveLock);
+			table_close(rel, ShareRowExclusiveLock);
 			continue;
 		}
 
@@ -600,7 +606,7 @@ OpenTableList(List *tables)
 			List	   *children;
 			ListCell   *child;
 
-			children = find_all_inheritors(myrelid, ShareUpdateExclusiveLock,
+			children = find_all_inheritors(myrelid, ShareRowExclusiveLock,
 										   NULL);
 
 			foreach(child, children)
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cce91891ab..501b94a332 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 9;
+use Test::More tests => 13;
 
 # Bug #15114
 
@@ -362,3 +362,162 @@ is( $node_subscriber_d_cols->safe_psql(
 
 $node_publisher_d_cols->stop('fast');
 $node_subscriber_d_cols->stop('fast');
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+my $node_publisher1 = get_new_node('node_publisher1');
+$node_publisher1->init(allows_streaming => 'logical');
+$node_publisher1->start;
+
+my $node_subscriber1 = get_new_node('node_subscriber1');
+$node_subscriber1->init(allows_streaming => 'logical');
+$node_subscriber1->start;
+
+$publisher_connstr = $node_publisher1->connstr . ' dbname=postgres';
+$node_publisher1->safe_psql('postgres',
+	"CREATE PUBLICATION pub1");
+$node_subscriber1->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+);
+
+# Create tables in publisher and subscriber.
+$node_publisher1->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE TABLE tab1_conc(a int);
+	CREATE TABLE tab1_conc_child() inherits (tab1_conc);
+));
+
+$node_subscriber1->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE TABLE tab1_conc(a int);
+	CREATE TABLE tab1_conc_child() inherits (tab1_conc);
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate a background session that keeps a transaction active.
+my $background_psql1 = $node_publisher1->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+# Maintain an active transaction with the table.
+$background_psql1->set_query_timer_restart();
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will wait for the lock and can only be completed after
+# the previous open transaction is committed.
+my $background_psql2 = $node_publisher1->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+# This operation will wait because there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION pub1 ADD TABLE tab_conc;\n");
+
+# Verify that the table addition is waiting to acquire a ShareRowExclusiveLock.
+$node_publisher1->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab_conc'::regclass AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+
+# Maintain an active transaction with inheritance table.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab1_conc_child VALUES (1);
+]);
+
+# Add an inheritance table to the publication, this operation will wait because
+# there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION pub1 ADD TABLE tab1_conc;\n");
+
+# Verify that the child table addition is waiting to acquire a
+# ShareRowExclusiveLock.
+$node_publisher1->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab1_conc_child'::regclass AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+# Wait till the tables are added to the publication.
+$node_publisher1->poll_query_until('postgres',
+	"SELECT COUNT(1) = 2 FROM pg_publication_rel WHERE prrelid IN ('tab_conc'::regclass, 'tab1_conc'::regclass);"
+  )
+  or die
+  "Timed out while waiting for alter publication to add the table to the publication";
+
+$background_psql1->quit;
+
+$node_publisher1->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO tab1_conc_child VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber1->safe_psql('postgres',
+	'ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION');
+
+$node_subscriber1->wait_for_subscription_sync($node_publisher1, 'sub1');
+
+my $result = $node_subscriber1->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber1->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab1_conc_child table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher1->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO tab1_conc_child VALUES (3);
+));
+$node_publisher1->wait_for_catchup('sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber1->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber1->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab1_conc_child added after table synchronization is replicated to the subscriber'
+);
+
+$node_publisher1->stop('fast');
+$node_subscriber1->stop('fast');
-- 
2.34.1

v4-0001-Fix-data-loss-during-initial-sync-in-logical-repl_PG12.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Fix-data-loss-during-initial-sync-in-logical-repl_PG12.patchDownload

From 7b2d8e60eb2192c4a10408b40a49914b3e1b5019 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Wed, 10 Jul 2024 21:57:57 +0530
Subject: [PATCH v5] Fix data loss during initial sync in logical replication.

Previously, when adding tables to a publication in PostgreSQL, they were
locked using ShareUpdateExclusiveLock mode. This mode allowed the lock to
succeed even if there were ongoing DML transactions on that table. As a
consequence, the ALTER PUBLICATION command could be completed before these
transactions, leading to a scenario where the catalog snapshot used for
replication did not include changes from transactions initiated before the
alteration.

To fix this issue, tables are now locked using ShareRowExclusiveLock mode
during the addition to a publication. This change ensures that the
ALTER PUBLICATION command waits for any ongoing transactions on the tables
(to be added to the publication) to be completed before proceeding. As a
result, transactions initiated before the publication alteration are
correctly included in the replication process.

Reported-by: Tomas Vondra
Diagnosed-by: Andres Freund
Author: Vignesh C, Tomas Vondra
Reviewed-by: Amit Kapila
Backpatch-through: 12
Discussion: https://postgr.es/m/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com
---
 src/backend/commands/publicationcmds.c |  15 ++-
 src/test/subscription/t/100_bugs.pl    | 161 ++++++++++++++++++++++++-
 2 files changed, 171 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 4f70af07ba..bf9761779f 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -506,7 +506,14 @@ RemovePublicationRelById(Oid proid)
 
 /*
  * Open relations specified by a RangeVar list.
- * The returned tables are locked in ShareUpdateExclusiveLock mode.
+ *
+ * The returned tables are locked in ShareRowExclusiveLock mode to add them
+ * to a publication. The table needs to be locked in ShareRowExclusiveLock
+ * mode to ensure that any ongoing transactions involving that table are
+ * completed before adding it to the publication. Otherwise, the transaction
+ * initiated before the alteration of the publication will continue to use a
+ * catalog snapshot predating the publication change, leading to
+ * non-replication of these transaction changes.
  */
 static List *
 OpenTableList(List *tables)
@@ -528,7 +535,7 @@ OpenTableList(List *tables)
 		/* Allow query cancel in case this takes a long time */
 		CHECK_FOR_INTERRUPTS();
 
-		rel = table_openrv(rv, ShareUpdateExclusiveLock);
+		rel = table_openrv(rv, ShareRowExclusiveLock);
 		myrelid = RelationGetRelid(rel);
 
 		/*
@@ -540,7 +547,7 @@ OpenTableList(List *tables)
 		 */
 		if (list_member_oid(relids, myrelid))
 		{
-			table_close(rel, ShareUpdateExclusiveLock);
+			table_close(rel, ShareRowExclusiveLock);
 			continue;
 		}
 
@@ -553,7 +560,7 @@ OpenTableList(List *tables)
 			List	   *children;
 			ListCell   *child;
 
-			children = find_all_inheritors(myrelid, ShareUpdateExclusiveLock,
+			children = find_all_inheritors(myrelid, ShareRowExclusiveLock,
 										   NULL);
 
 			foreach(child, children)
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index f5968ffa97..ba11528750 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 11;
 
 # Bug #15114
 
@@ -244,3 +244,162 @@ is( $node_subscriber_d_cols->safe_psql(
 
 $node_publisher_d_cols->stop('fast');
 $node_subscriber_d_cols->stop('fast');
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+my $node_publisher1 = get_new_node('node_publisher1');
+$node_publisher1->init(allows_streaming => 'logical');
+$node_publisher1->start;
+
+my $node_subscriber1 = get_new_node('node_subscriber1');
+$node_subscriber1->init(allows_streaming => 'logical');
+$node_subscriber1->start;
+
+$publisher_connstr = $node_publisher1->connstr . ' dbname=postgres';
+$node_publisher1->safe_psql('postgres',
+	"CREATE PUBLICATION pub1");
+$node_subscriber1->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+);
+
+# Create tables in publisher and subscriber.
+$node_publisher1->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE TABLE tab1_conc(a int);
+	CREATE TABLE tab1_conc_child() inherits (tab1_conc);
+));
+
+$node_subscriber1->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE TABLE tab1_conc(a int);
+	CREATE TABLE tab1_conc_child() inherits (tab1_conc);
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate a background session that keeps a transaction active.
+my $background_psql1 = $node_publisher1->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+# Maintain an active transaction with the table.
+$background_psql1->set_query_timer_restart();
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will wait for the lock and can only be completed after
+# the previous open transaction is committed.
+my $background_psql2 = $node_publisher1->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+# This operation will wait because there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION pub1 ADD TABLE tab_conc;\n");
+
+# Verify that the table addition is waiting to acquire a ShareRowExclusiveLock.
+$node_publisher1->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab_conc'::regclass AND mode = 'ShareRowExclusiveLock';"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+
+# Maintain an active transaction with inheritance table.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab1_conc_child VALUES (1);
+]);
+
+# Add an inheritance table to the publication, this operation will wait because
+# there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION pub1 ADD TABLE tab1_conc;\n");
+
+# Verify that the child table addition is waiting to acquire a
+# ShareRowExclusiveLock.
+$node_publisher1->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab1_conc_child'::regclass AND mode = 'ShareRowExclusiveLock';"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+# Wait till the tables are added to the publication.
+$node_publisher1->poll_query_until('postgres',
+	"SELECT COUNT(1) = 2 FROM pg_publication_rel WHERE prrelid IN ('tab_conc'::regclass, 'tab1_conc'::regclass);"
+  )
+  or die
+  "Timed out while waiting for alter publication to add the table to the publication";
+
+$background_psql1->quit;
+
+$node_publisher1->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO tab1_conc_child VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber1->safe_psql('postgres',
+	'ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION');
+
+$node_subscriber1->wait_for_subscription_sync($node_publisher1, 'sub1');
+
+my $result = $node_subscriber1->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber1->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab1_conc_child table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher1->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO tab1_conc_child VALUES (3);
+));
+$node_publisher1->wait_for_catchup('sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber1->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber1->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab1_conc_child added after table synchronization is replicated to the subscriber'
+);
+
+$node_publisher1->stop('fast');
+$node_subscriber1->stop('fast');
-- 
2.34.1

v4-0001-Fix-data-loss-during-initial-sync-in-logical-repl_PG13.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Fix-data-loss-during-initial-sync-in-logical-repl_PG13.patchDownload

From 496463b1d4b4e7d0685f03144ed23c7c3b24a7a0 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Wed, 10 Jul 2024 21:43:43 +0530
Subject: [PATCH v5] Fix data loss during initial sync in logical replication.

Previously, when adding tables to a publication in PostgreSQL, they were
locked using ShareUpdateExclusiveLock mode. This mode allowed the lock to
succeed even if there were ongoing DML transactions on that table. As a
consequence, the ALTER PUBLICATION command could be completed before these
transactions, leading to a scenario where the catalog snapshot used for
replication did not include changes from transactions initiated before the
alteration.

To fix this issue, tables are now locked using ShareRowExclusiveLock mode
during the addition to a publication. This change ensures that the
ALTER PUBLICATION command waits for any ongoing transactions on the tables
(to be added to the publication) to be completed before proceeding. As a
result, transactions initiated before the publication alteration are
correctly included in the replication process.

Reported-by: Tomas Vondra
Diagnosed-by: Andres Freund
Author: Vignesh C, Tomas Vondra
Reviewed-by: Amit Kapila
Backpatch-through: 12
Discussion: https://postgr.es/m/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com
---
 src/backend/commands/publicationcmds.c |  16 ++-
 src/test/subscription/t/100_bugs.pl    | 161 ++++++++++++++++++++++++-
 2 files changed, 171 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 7ee8825522..8135db2cc0 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -548,8 +548,14 @@ RemovePublicationRelById(Oid proid)
 
 /*
  * Open relations specified by a RangeVar list.
- * The returned tables are locked in ShareUpdateExclusiveLock mode in order to
- * add them to a publication.
+ *
+ * The returned tables are locked in ShareRowExclusiveLock mode to add them
+ * to a publication. The table needs to be locked in ShareRowExclusiveLock
+ * mode to ensure that any ongoing transactions involving that table are
+ * completed before adding it to the publication. Otherwise, the transaction
+ * initiated before the alteration of the publication will continue to use a
+ * catalog snapshot predating the publication change, leading to
+ * non-replication of these transaction changes.
  */
 static List *
 OpenTableList(List *tables)
@@ -571,7 +577,7 @@ OpenTableList(List *tables)
 		/* Allow query cancel in case this takes a long time */
 		CHECK_FOR_INTERRUPTS();
 
-		rel = table_openrv(rv, ShareUpdateExclusiveLock);
+		rel = table_openrv(rv, ShareRowExclusiveLock);
 		myrelid = RelationGetRelid(rel);
 
 		/*
@@ -583,7 +589,7 @@ OpenTableList(List *tables)
 		 */
 		if (list_member_oid(relids, myrelid))
 		{
-			table_close(rel, ShareUpdateExclusiveLock);
+			table_close(rel, ShareRowExclusiveLock);
 			continue;
 		}
 
@@ -601,7 +607,7 @@ OpenTableList(List *tables)
 			List	   *children;
 			ListCell   *child;
 
-			children = find_all_inheritors(myrelid, ShareUpdateExclusiveLock,
+			children = find_all_inheritors(myrelid, ShareRowExclusiveLock,
 										   NULL);
 
 			foreach(child, children)
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 7ebb97bbcf..bdbacef33c 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 9;
+use Test::More tests => 13;
 
 # Bug #15114
 
@@ -291,3 +291,162 @@ is( $node_subscriber_d_cols->safe_psql(
 
 $node_publisher_d_cols->stop('fast');
 $node_subscriber_d_cols->stop('fast');
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+my $node_publisher1 = get_new_node('node_publisher1');
+$node_publisher1->init(allows_streaming => 'logical');
+$node_publisher1->start;
+
+my $node_subscriber1 = get_new_node('node_subscriber1');
+$node_subscriber1->init(allows_streaming => 'logical');
+$node_subscriber1->start;
+
+$publisher_connstr = $node_publisher1->connstr . ' dbname=postgres';
+$node_publisher1->safe_psql('postgres',
+	"CREATE PUBLICATION pub1");
+$node_subscriber1->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+);
+
+# Create tables in publisher and subscriber.
+$node_publisher1->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE TABLE tab1_conc(a int);
+	CREATE TABLE tab1_conc_child() inherits (tab1_conc);
+));
+
+$node_subscriber1->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE TABLE tab1_conc(a int);
+	CREATE TABLE tab1_conc_child() inherits (tab1_conc);
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate a background session that keeps a transaction active.
+my $background_psql1 = $node_publisher1->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+# Maintain an active transaction with the table.
+$background_psql1->set_query_timer_restart();
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will wait for the lock and can only be completed after
+# the previous open transaction is committed.
+my $background_psql2 = $node_publisher1->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+# This operation will wait because there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION pub1 ADD TABLE tab_conc;\n");
+
+# Verify that the table addition is waiting to acquire a ShareRowExclusiveLock.
+$node_publisher1->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab_conc'::regclass AND mode = 'ShareRowExclusiveLock';"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+
+# Maintain an active transaction with inheritance table.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab1_conc_child VALUES (1);
+]);
+
+# Add an inheritance table to the publication, this operation will wait because
+# there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION pub1 ADD TABLE tab1_conc;\n");
+
+# Verify that the child table addition is waiting to acquire a
+# ShareRowExclusiveLock.
+$node_publisher1->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab1_conc_child'::regclass AND mode = 'ShareRowExclusiveLock';"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+# Wait till the tables are added to the publication.
+$node_publisher1->poll_query_until('postgres',
+	"SELECT COUNT(1) = 2 FROM pg_publication_rel WHERE prrelid IN ('tab_conc'::regclass, 'tab1_conc'::regclass);"
+  )
+  or die
+  "Timed out while waiting for alter publication to add the table to the publication";
+
+$background_psql1->quit;
+
+$node_publisher1->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO tab1_conc_child VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber1->safe_psql('postgres',
+	'ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION');
+
+$node_subscriber1->wait_for_subscription_sync($node_publisher1, 'sub1');
+
+my $result = $node_subscriber1->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber1->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab1_conc_child table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher1->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO tab1_conc_child VALUES (3);
+));
+$node_publisher1->wait_for_catchup('sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber1->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber1->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab1_conc_child added after table synchronization is replicated to the subscriber'
+);
+
+$node_publisher1->stop('fast');
+$node_subscriber1->stop('fast');
-- 
2.34.1

v4-0001-Fix-data-loss-during-initial-sync-in-logical-repl_PG16.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Fix-data-loss-during-initial-sync-in-logical-repl_PG16.patchDownload

From acd506f960cb1851e47ed7c3966b71618d6d2182 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Wed, 10 Jul 2024 20:58:04 +0530
Subject: [PATCH v5] Fix data loss during initial sync in logical replication.

Previously, when adding tables to a publication in PostgreSQL, they were
locked using ShareUpdateExclusiveLock mode. This mode allowed the lock to
succeed even if there were ongoing DML transactions on that table. As a
consequence, the ALTER PUBLICATION command could be completed before these
transactions, leading to a scenario where the catalog snapshot used for
replication did not include changes from transactions initiated before the
alteration.

To fix this issue, tables are now locked using ShareRowExclusiveLock mode
during the addition to a publication. This change ensures that the
ALTER PUBLICATION command waits for any ongoing transactions on the tables
(to be added to the publication) to be completed before proceeding. As a
result, transactions initiated before the publication alteration are
correctly included in the replication process.

Reported-by: Tomas Vondra
Diagnosed-by: Andres Freund
Author: Vignesh C, Tomas Vondra
Reviewed-by: Amit Kapila
Backpatch-through: 12
Discussion: https://postgr.es/m/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com
---
 src/backend/commands/publicationcmds.c |  16 ++-
 src/test/subscription/t/100_bugs.pl    | 142 +++++++++++++++++++++++++
 2 files changed, 153 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index f4ba572697..4bd56edb9b 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -1549,8 +1549,14 @@ RemovePublicationSchemaById(Oid psoid)
 
 /*
  * Open relations specified by a PublicationTable list.
- * The returned tables are locked in ShareUpdateExclusiveLock mode in order to
- * add them to a publication.
+ *
+ * The returned tables are locked in ShareRowExclusiveLock mode to add them
+ * to a publication. The table needs to be locked in ShareRowExclusiveLock
+ * mode to ensure that any ongoing transactions involving that table are
+ * completed before adding it to the publication. Otherwise, the transaction
+ * initiated before the alteration of the publication will continue to use a
+ * catalog snapshot predating the publication change, leading to
+ * non-replication of these transaction changes.
  */
 static List *
 OpenTableList(List *tables)
@@ -1575,7 +1581,7 @@ OpenTableList(List *tables)
 		/* Allow query cancel in case this takes a long time */
 		CHECK_FOR_INTERRUPTS();
 
-		rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+		rel = table_openrv(t->relation, ShareRowExclusiveLock);
 		myrelid = RelationGetRelid(rel);
 
 		/*
@@ -1601,7 +1607,7 @@ OpenTableList(List *tables)
 						 errmsg("conflicting or redundant column lists for table \"%s\"",
 								RelationGetRelationName(rel))));
 
-			table_close(rel, ShareUpdateExclusiveLock);
+			table_close(rel, ShareRowExclusiveLock);
 			continue;
 		}
 
@@ -1629,7 +1635,7 @@ OpenTableList(List *tables)
 			List	   *children;
 			ListCell   *child;
 
-			children = find_all_inheritors(myrelid, ShareUpdateExclusiveLock,
+			children = find_all_inheritors(myrelid, ShareRowExclusiveLock,
 										   NULL);
 
 			foreach(child, children)
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 091da5a506..1d087a74c0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -488,6 +488,148 @@ $result = $node_subscriber->safe_psql('postgres',
 is($result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Create tables in publisher and subscriber.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE TABLE tab1_conc(a int);
+	CREATE TABLE tab1_conc_child() inherits (tab1_conc);
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE TABLE tab1_conc(a int);
+	CREATE TABLE tab1_conc_child() inherits (tab1_conc);
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate a background session that keeps a transaction active.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+# Maintain an active transaction with the table.
+$background_psql1->set_query_timer_restart();
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will wait for the lock and can only be completed after
+# the previous open transaction is committed.
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+# This operation will wait because there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION pub1 ADD TABLE tab_conc;\n");
+
+# Verify that the table addition is waiting to acquire a ShareRowExclusiveLock.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab_conc'::regclass AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+
+# Maintain an active transaction with inheritance table.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab1_conc_child VALUES (1);
+]);
+
+# Add an inheritance table to the publication, this operation will wait because
+# there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION pub1 ADD TABLE tab1_conc;\n");
+
+# Verify that the child table addition is waiting to acquire a
+# ShareRowExclusiveLock.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab1_conc_child'::regclass AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+# Wait till the tables are added to the publication.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 2 FROM pg_publication_rel WHERE prrelid IN ('tab_conc'::regclass, 'tab1_conc'::regclass);"
+  )
+  or die
+  "Timed out while waiting for alter publication to add the table to the publication";
+
+$background_psql1->quit;
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO tab1_conc_child VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab1_conc_child table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO tab1_conc_child VALUES (3);
+));
+$node_publisher->wait_for_catchup('sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab1_conc_child added after table synchronization is replicated to the subscriber'
+);
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.34.1

v2_issue_reproduce_testcase_head.patchtext/x-patch; charset=US-ASCII; name=v2_issue_reproduce_testcase_head.patchDownload

diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..bd8c305c7d 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,95 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Incremental data synchronization skipped when a new table is added, if
+# there is a concurrent active transaction involving the same table.
+
+# Create table in publisher and subscriber.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE TABLE tab1_conc (a int);
+	CREATE TABLE tab1_conc_child () inherits (tab1_conc);
+));
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE TABLE tab1_conc (a int);
+	CREATE TABLE tab1_conc_child () inherits (tab1_conc);
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate a background session that keeps a transaction active.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+# Maintain an active transaction with the table.
+$background_psql1->set_query_timer_restart();
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+	INSERT INTO tab1_conc_child VALUES (1);
+]);
+
+# Add the table to the publication from background_psql
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+# This will wait as the open transaction holding a lock.
+$background_psql2->query_until(qr//, "ALTER PUBLICATION pub1 ADD TABLE tab_conc, tab1_conc;\n");
+
+$node_publisher->poll_query_until('postgres',
+"SELECT COUNT(1) = 2 FROM pg_publication_rel WHERE prrelid = 'tab_conc'::regclass OR prrelid = 'tab1_conc'::regclass;"
+  )
+  or die
+  "Timed out while waiting for the table tab_conc is added to pg_publication_rel";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+$background_psql1->query_safe(qq[INSERT INTO tab_conc VALUES (2)]);
+$background_psql1->query_safe(qq[INSERT INTO tab1_conc_child VALUES (2)]);
+$background_psql1->quit;
+
+# Refresh the publication
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2), 'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2), 'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed');
+
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_conc values(3)");
+$node_publisher->safe_psql('postgres', "INSERT INTO tab1_conc_child values(3)");
+
+$node_publisher->wait_for_catchup('sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3), 'Verify that the incremental data added after table synchronization is replicated to the subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2
+3), 'Verify that the incremental data added after table synchronization is replicated to the subscriber');
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');

#22

Nitin Motiani

nitinmotiani@google.com

over 1 year ago

In reply to: vignesh C (#21)

4 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, Jul 10, 2024 at 10:39 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 10 Jul 2024 at 12:28, Amit Kapila <amit.kapila16@gmail.com> wrote:

The patch missed to use the ShareRowExclusiveLock for partitions, see
attached. I haven't tested it but they should also face the same
problem. Apart from that, I have changed the comments in a few places
in the patch.

I could not hit the updated ShareRowExclusiveLock changes through the
partition table, instead I could verify it using the inheritance
table. Added a test for the same and also attaching the backbranch
patch.

Hi,

I tested alternative-experimental-fix-lock.patch provided by Tomas
(replaces SUE with SRE in OpenTableList). I believe there are a couple
of scenarios the patch does not cover.

1. It doesn't handle the case of "ALTER PUBLICATION <pub> ADD TABLES
IN SCHEMA <schema>".

I took crash-test.sh provided by Tomas and modified it to add all
tables in the schema to publication using the following command :

ALTER PUBLICATION p ADD TABLES IN SCHEMA public

The modified script is attached (crash-test-with-schema.sh). With this
script, I can reproduce the issue even with the patch applied. This is
because the code path to add a schema to the publication doesn't go
through OpenTableList.

I have also attached a script run-test-with-schema.sh to run
crash-test-with-schema.sh in a loop with randomly generated parameters
(modified from run.sh provided by Tomas).

2. The second issue is a deadlock which happens when the alter
publication command is run for a comma separated list of tables.

I created another script create-test-tables-order-reverse.sh. This
script runs a command like the following :

ALTER PUBLICATION p ADD TABLE test_2,test_1

Running the above script, I was able to get a deadlock error (the
output is attached in deadlock.txt). In the alter publication command,
I added the tables in the reverse order to increase the probability of
the deadlock. But it should happen with any order of tables.

I am not sure if the deadlock is a major issue because detecting the
deadlock is better than data loss. The schema issue is probably more
important. I didn't test it out with the latest patches sent by
Vignesh but since the code changes in that patch are also in
OpenTableList, I think the schema scenario won't be covered by those.

Thanks & Regards,
Nitin Motiani
Google

#23

Nitin Motiani

nitinmotiani@google.com

over 1 year ago

In reply to: Nitin Motiani (#22)

3 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, Jul 10, 2024 at 11:22 PM Nitin Motiani <nitinmotiani@google.com> wrote:

On Wed, Jul 10, 2024 at 10:39 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 10 Jul 2024 at 12:28, Amit Kapila <amit.kapila16@gmail.com> wrote:

The patch missed to use the ShareRowExclusiveLock for partitions, see
attached. I haven't tested it but they should also face the same
problem. Apart from that, I have changed the comments in a few places
in the patch.

I could not hit the updated ShareRowExclusiveLock changes through the
partition table, instead I could verify it using the inheritance
table. Added a test for the same and also attaching the backbranch
patch.

Hi,

I tested alternative-experimental-fix-lock.patch provided by Tomas
(replaces SUE with SRE in OpenTableList). I believe there are a couple
of scenarios the patch does not cover.

1. It doesn't handle the case of "ALTER PUBLICATION <pub> ADD TABLES
IN SCHEMA <schema>".

I took crash-test.sh provided by Tomas and modified it to add all
tables in the schema to publication using the following command :

ALTER PUBLICATION p ADD TABLES IN SCHEMA public

The modified script is attached (crash-test-with-schema.sh). With this
script, I can reproduce the issue even with the patch applied. This is
because the code path to add a schema to the publication doesn't go
through OpenTableList.

I have also attached a script run-test-with-schema.sh to run
crash-test-with-schema.sh in a loop with randomly generated parameters
(modified from run.sh provided by Tomas).

2. The second issue is a deadlock which happens when the alter
publication command is run for a comma separated list of tables.

I created another script create-test-tables-order-reverse.sh. This
script runs a command like the following :

ALTER PUBLICATION p ADD TABLE test_2,test_1

Running the above script, I was able to get a deadlock error (the
output is attached in deadlock.txt). In the alter publication command,
I added the tables in the reverse order to increase the probability of
the deadlock. But it should happen with any order of tables.

I am not sure if the deadlock is a major issue because detecting the
deadlock is better than data loss. The schema issue is probably more
important. I didn't test it out with the latest patches sent by
Vignesh but since the code changes in that patch are also in
OpenTableList, I think the schema scenario won't be covered by those.

Hi,

I looked further into the scenario of adding the tables in schema to
the publication. Since in that case, the entry is added to
pg_publication_namespace instead of pg_publication_rel, the codepaths
for 'add table' and 'add tables in schema' are different. And in the
'add tables in schema' scenario, the OpenTableList function is not
called to get the relation ids. Therefore even with the proposed
patch, the data loss issue still persists in that case.

To validate this idea, I tried locking all the affected tables in the
schema just before the invalidation for those relations (in
ShareRowExclusiveLock mode). I am attaching the small patch for that
(alter_pub_for_schema.patch) where the change is made in the function
publication_add_schema in pg_publication.c. I am not sure if this is
the best place to make this change or if it is the right fix. It is
conceptually similar to the proposed change in OpenTableList but here
we are not just changing the lockmode but taking locks which were not
taken before. But with this change, the data loss errors went away in
my test script.

Another issue which persists with this change is the deadlock. Since
multiple table locks are acquired, the test script detects deadlock a
few times. Therefore I'm also attaching another modified script which
does a few retries in case of deadlock. The script is
crash-test-with-retries-for-schema.sh. It runs the following command
in a retry loop :

ALTER PUBLICATION p ADD TABLES IN SCHEMA public

If the command fails, it sleeps for a random amount of time (upper
bound by a MAXWAIT parameter) and then retries the command. If it
fails to run the command in the max number of retries, the final
return value from the script is DEADLOCK as we can't do a consistency
check in this scenario. Also attached is another script
run-with-deadlock-detection.sh which can run the above script for
multiple iterations.

I tried the test scripts with and without alter_pub_for_schema.patch.
Without the patch, I get the final output ERROR majority of the time
which means that the publication was altered successfully but the data
was lost on the subscriber. When I run it with the patch, I get a mix
of OK (no data loss) and DEADLOCK (the publication was not altered)
but no ERROR. I think by changing the parameters of sleep time and
number of retries we can get different fractions of OK and DEADLOCK.

I am not sure if this is the right or a clean way to fix the issue but
I think conceptually this might be the right direction. Please let me
know if my understanding is wrong or if I'm missing something.

Thanks & Regards,
Nitin Motiani
Google

Attachments:

alter_pub_for_schema.patchapplication/octet-stream; name=alter_pub_for_schema.patchDownload

diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index 0602398a54..067336d02c 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -32,6 +32,7 @@
 #include "catalog/pg_type.h"
 #include "commands/publicationcmds.h"
 #include "funcapi.h"
+#include "storage/lmgr.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
@@ -614,6 +615,7 @@ publication_add_schema(Oid pubid, Oid schemaid, bool if_not_exists)
 	List	   *schemaRels = NIL;
 	ObjectAddress myself,
 				referenced;
+	ListCell* lc;
 
 	rel = table_open(PublicationNamespaceRelationId, RowExclusiveLock);
 
@@ -677,6 +679,9 @@ publication_add_schema(Oid pubid, Oid schemaid, bool if_not_exists)
 	 */
 	schemaRels = GetSchemaPublicationRelations(schemaid,
 											   PUBLICATION_PART_ALL);
+	foreach(lc, schemaRels) {
+		LockRelationOid(lfirst_oid(lc), ShareRowExclusiveLock);
+	}
 	InvalidatePublicationRels(schemaRels);
 
 	return myself;

run-with-deadlock-detection.shtext/x-sh; charset=US-ASCII; name=run-with-deadlock-detection.shDownload

crash-test-with-retries-for-schema.shtext/x-sh; charset=US-ASCII; name=crash-test-with-retries-for-schema.shDownload

#24

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Nitin Motiani (#23)

Re: long-standing data loss bug in initial sync of logical replication

On Thu, Jul 11, 2024 at 6:19 PM Nitin Motiani <nitinmotiani@google.com> wrote:

On Wed, Jul 10, 2024 at 11:22 PM Nitin Motiani <nitinmotiani@google.com> wrote:

On Wed, Jul 10, 2024 at 10:39 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 10 Jul 2024 at 12:28, Amit Kapila <amit.kapila16@gmail.com> wrote:

The patch missed to use the ShareRowExclusiveLock for partitions, see
attached. I haven't tested it but they should also face the same
problem. Apart from that, I have changed the comments in a few places
in the patch.

I could not hit the updated ShareRowExclusiveLock changes through the
partition table, instead I could verify it using the inheritance
table. Added a test for the same and also attaching the backbranch
patch.

Hi,

I tested alternative-experimental-fix-lock.patch provided by Tomas
(replaces SUE with SRE in OpenTableList). I believe there are a couple
of scenarios the patch does not cover.

1. It doesn't handle the case of "ALTER PUBLICATION <pub> ADD TABLES
IN SCHEMA <schema>".

I took crash-test.sh provided by Tomas and modified it to add all
tables in the schema to publication using the following command :

ALTER PUBLICATION p ADD TABLES IN SCHEMA public

The modified script is attached (crash-test-with-schema.sh). With this
script, I can reproduce the issue even with the patch applied. This is
because the code path to add a schema to the publication doesn't go
through OpenTableList.

I have also attached a script run-test-with-schema.sh to run
crash-test-with-schema.sh in a loop with randomly generated parameters
(modified from run.sh provided by Tomas).

2. The second issue is a deadlock which happens when the alter
publication command is run for a comma separated list of tables.

I created another script create-test-tables-order-reverse.sh. This
script runs a command like the following :

ALTER PUBLICATION p ADD TABLE test_2,test_1

Running the above script, I was able to get a deadlock error (the
output is attached in deadlock.txt). In the alter publication command,
I added the tables in the reverse order to increase the probability of
the deadlock. But it should happen with any order of tables.

I am not sure if the deadlock is a major issue because detecting the
deadlock is better than data loss.

The deadlock reported in this case is an expected behavior. This is no
different that locking tables or rows in reverse order.

I looked further into the scenario of adding the tables in schema to
the publication. Since in that case, the entry is added to
pg_publication_namespace instead of pg_publication_rel, the codepaths
for 'add table' and 'add tables in schema' are different. And in the
'add tables in schema' scenario, the OpenTableList function is not
called to get the relation ids. Therefore even with the proposed
patch, the data loss issue still persists in that case.

To validate this idea, I tried locking all the affected tables in the
schema just before the invalidation for those relations (in
ShareRowExclusiveLock mode).

This sounds like a reasonable approach to fix the issue. However, we
should check SET publication_object as well, especially the drop part
in it. It should not happen that we miss sending the data for ADD but
for DROP, we send data when we shouldn't have sent it.

--
With Regards,
Amit Kapila.

#25

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: Amit Kapila (#24)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Mon, 15 Jul 2024 at 15:31, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 11, 2024 at 6:19 PM Nitin Motiani <nitinmotiani@google.com> wrote:

On Wed, Jul 10, 2024 at 11:22 PM Nitin Motiani <nitinmotiani@google.com> wrote:

On Wed, Jul 10, 2024 at 10:39 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 10 Jul 2024 at 12:28, Amit Kapila <amit.kapila16@gmail.com> wrote:

The patch missed to use the ShareRowExclusiveLock for partitions, see
attached. I haven't tested it but they should also face the same
problem. Apart from that, I have changed the comments in a few places
in the patch.

I could not hit the updated ShareRowExclusiveLock changes through the
partition table, instead I could verify it using the inheritance
table. Added a test for the same and also attaching the backbranch
patch.

Hi,

I tested alternative-experimental-fix-lock.patch provided by Tomas
(replaces SUE with SRE in OpenTableList). I believe there are a couple
of scenarios the patch does not cover.

1. It doesn't handle the case of "ALTER PUBLICATION <pub> ADD TABLES
IN SCHEMA <schema>".

I took crash-test.sh provided by Tomas and modified it to add all
tables in the schema to publication using the following command :

ALTER PUBLICATION p ADD TABLES IN SCHEMA public

The modified script is attached (crash-test-with-schema.sh). With this
script, I can reproduce the issue even with the patch applied. This is
because the code path to add a schema to the publication doesn't go
through OpenTableList.

I have also attached a script run-test-with-schema.sh to run
crash-test-with-schema.sh in a loop with randomly generated parameters
(modified from run.sh provided by Tomas).

2. The second issue is a deadlock which happens when the alter
publication command is run for a comma separated list of tables.

I created another script create-test-tables-order-reverse.sh. This
script runs a command like the following :

ALTER PUBLICATION p ADD TABLE test_2,test_1

Running the above script, I was able to get a deadlock error (the
output is attached in deadlock.txt). In the alter publication command,
I added the tables in the reverse order to increase the probability of
the deadlock. But it should happen with any order of tables.

I am not sure if the deadlock is a major issue because detecting the
deadlock is better than data loss.

The deadlock reported in this case is an expected behavior. This is no
different that locking tables or rows in reverse order.

I looked further into the scenario of adding the tables in schema to
the publication. Since in that case, the entry is added to
pg_publication_namespace instead of pg_publication_rel, the codepaths
for 'add table' and 'add tables in schema' are different. And in the
'add tables in schema' scenario, the OpenTableList function is not
called to get the relation ids. Therefore even with the proposed
patch, the data loss issue still persists in that case.

To validate this idea, I tried locking all the affected tables in the
schema just before the invalidation for those relations (in
ShareRowExclusiveLock mode).

This sounds like a reasonable approach to fix the issue. However, we
should check SET publication_object as well, especially the drop part
in it. It should not happen that we miss sending the data for ADD but
for DROP, we send data when we shouldn't have sent it.

There were few other scenarios, similar to the one you mentioned,
where the issue occurred. For example: a) When specifying a subset of
existing tables in the ALTER PUBLICATION ... SET TABLE command, the
tables that were supposed to be removed from the publication were not
locked in ShareRowExclusiveLock mode. b) The ALTER PUBLICATION ...
DROP TABLES IN SCHEMA command did not lock the relations that will be
removed from the publication in ShareRowExclusiveLock mode. Both of
these scenarios resulted in data inconsistency due to inadequate
locking. The attached patch addresses these issues.

Regards,
Vignesh

Attachments:

v5-0001-Fix-data-loss-during-initial-sync-in-logical-repl.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Fix-data-loss-during-initial-sync-in-logical-repl.patchDownload

From e1e79bcf24cacf4f8291692f7815dd323e7b4ab5 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Tue, 9 Jul 2024 19:23:10 +0530
Subject: [PATCH v5] Fix data loss during initial sync in logical replication.

Previously, when adding tables to a publication in PostgreSQL, they were
locked using ShareUpdateExclusiveLock mode. This mode allowed the lock to
succeed even if there were ongoing DML transactions on that table. As a
consequence, the ALTER PUBLICATION command could be completed before these
transactions, leading to a scenario where the catalog snapshot used for
replication did not include changes from transactions initiated before the
alteration.

To fix this issue, tables are now locked using ShareRowExclusiveLock mode
during the addition to a publication. This change ensures that the
ALTER PUBLICATION command waits for any ongoing transactions on the tables
(to be added to the publication) to be completed before proceeding. As a
result, transactions initiated before the publication alteration are
correctly included in the replication process.

A similar problem occurred when adding tables in schema to a publication
for an ongoing DML transaction involving those tables in the schema, as
the tables were not locked during the ALTER PUBLICATION.

The issue has now been resolved by locking all the tables in the schema
with ShareRowExclusiveLock mode during their addition to the publication
which resolves the addition of tables in schema waits similarly to the
addition of tables.

Reported-by: Tomas Vondra
Diagnosed-by: Andres Freund
Author: Vignesh C, Tomas Vondra
Reviewed-by: Amit Kapila
Backpatch-through: 12
Discussion: https://postgr.es/m/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com
---
 src/backend/catalog/pg_publication.c   |   9 +
 src/backend/commands/publicationcmds.c |  34 +-
 src/test/subscription/t/100_bugs.pl    | 481 +++++++++++++++++++++++++
 3 files changed, 518 insertions(+), 6 deletions(-)

diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index 0602398a54..f078b705d6 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -32,6 +32,7 @@
 #include "catalog/pg_type.h"
 #include "commands/publicationcmds.h"
 #include "funcapi.h"
+#include "storage/lmgr.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
@@ -677,6 +678,14 @@ publication_add_schema(Oid pubid, Oid schemaid, bool if_not_exists)
 	 */
 	schemaRels = GetSchemaPublicationRelations(schemaid,
 											   PUBLICATION_PART_ALL);
+
+	/*
+	 * Data loss due to concurrency issues are avoided by locking the
+	 * relation in ShareRowExclusiveLock as described atop OpenTableList.
+	 */
+	foreach_oid(schrelid, schemaRels)
+		LockRelationOid(schrelid, ShareRowExclusiveLock);
+
 	InvalidatePublicationRels(schemaRels);
 
 	return myself;
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 6ea709988e..d5cd9e3820 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt *stmt, HeapTuple tup,
 				oldrel = palloc(sizeof(PublicationRelInfo));
 				oldrel->whereClause = NULL;
 				oldrel->columns = NIL;
+
+				/*
+				 * Data loss due to concurrency issues are avoided by locking
+				 * the relation in ShareRowExclusiveLock as described atop
+				 * OpenTableList.
+				 */
 				oldrel->relation = table_open(oldrelid,
-											  ShareUpdateExclusiveLock);
+											  ShareRowExclusiveLock);
 				delrels = lappend(delrels, oldrel);
 			}
 		}
@@ -1542,8 +1548,14 @@ RemovePublicationSchemaById(Oid psoid)
 
 /*
  * Open relations specified by a PublicationTable list.
- * The returned tables are locked in ShareUpdateExclusiveLock mode in order to
- * add them to a publication.
+ *
+ * The returned tables are locked in ShareRowExclusiveLock mode to add them
+ * to a publication. The table needs to be locked in ShareRowExclusiveLock
+ * mode to ensure that any ongoing transactions involving that table are
+ * completed before adding it to the publication. Otherwise, the transaction
+ * initiated before the alteration of the publication will continue to use a
+ * catalog snapshot predating the publication change, leading to
+ * non-replication of these transaction changes.
  */
 static List *
 OpenTableList(List *tables)
@@ -1568,7 +1580,7 @@ OpenTableList(List *tables)
 		/* Allow query cancel in case this takes a long time */
 		CHECK_FOR_INTERRUPTS();
 
-		rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+		rel = table_openrv(t->relation, ShareRowExclusiveLock);
 		myrelid = RelationGetRelid(rel);
 
 		/*
@@ -1594,7 +1606,7 @@ OpenTableList(List *tables)
 						 errmsg("conflicting or redundant column lists for table \"%s\"",
 								RelationGetRelationName(rel))));
 
-			table_close(rel, ShareUpdateExclusiveLock);
+			table_close(rel, ShareRowExclusiveLock);
 			continue;
 		}
 
@@ -1622,7 +1634,7 @@ OpenTableList(List *tables)
 			List	   *children;
 			ListCell   *child;
 
-			children = find_all_inheritors(myrelid, ShareUpdateExclusiveLock,
+			children = find_all_inheritors(myrelid, ShareRowExclusiveLock,
 										   NULL);
 
 			foreach(child, children)
@@ -1860,6 +1872,16 @@ PublicationDropSchemas(Oid pubid, List *schemas, bool missing_ok)
 	foreach(lc, schemas)
 	{
 		Oid			schemaid = lfirst_oid(lc);
+		List 		*schemaRels;
+
+		schemaRels = GetSchemaPublicationRelations(schemaid, PUBLICATION_PART_ALL);
+
+		/*
+		 * Data loss due to concurrency issues are avoided by locking the
+		 * relation in ShareRowExclusiveLock as described atop OpenTableList.
+		 */
+		foreach_oid(schrelid, schemaRels)
+			LockRelationOid(schrelid, ShareRowExclusiveLock);
 
 		psid = GetSysCacheOid2(PUBLICATIONNAMESPACEMAP,
 							   Anum_pg_publication_namespace_oid,
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..a087bc2d08 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,487 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# cleanpup
+$node_publisher->safe_psql('postgres', qq(DROP  PUBLICATION pub1;));
+$node_subscriber->safe_psql('postgres', qq(DROP  SUBSCRIPTION sub1;));
+
+# =============================================================================
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+# =============================================================================
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE TABLE tab1_conc(a int);
+	CREATE TABLE tab1_conc_child() inherits (tab1_conc);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SCHEMA sch4;
+	CREATE TABLE sch4.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE TABLE tab1_conc(a int);
+	CREATE TABLE tab1_conc_child() inherits (tab1_conc);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SCHEMA sch4;
+	CREATE TABLE sch4.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate a background session that keeps a transaction active.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+# Maintain an active transaction with the table.
+$background_psql1->set_query_timer_restart();
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will wait for the lock and can only be completed after
+# the previous open transaction is committed.
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+# This operation will wait because there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc;\n");
+
+# Verify that the table addition is waiting to acquire a ShareRowExclusiveLock.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab_conc'::regclass AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# =============================================================================
+# This bug is  present with inheritance table as well.
+# =============================================================================
+
+# Maintain an active transaction with inheritance table.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab1_conc_child VALUES (1);
+]);
+
+# Add an inheritance table to the publication, this operation will wait because
+# there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab1_conc;\n");
+
+# Verify that the child table addition is waiting to acquire a
+# ShareRowExclusiveLock.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab1_conc_child'::regclass AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+# Wait till the tables are added to the publication.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_publication_rel WHERE prrelid IN ('tab1_conc'::regclass);"
+  )
+  or die
+  "Timed out while waiting for alter publication to add the table to the publication";
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab1_conc_child VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab1_conc_child table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab1_conc_child VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab1_conc_child added after table synchronization is replicated to the subscriber'
+);
+
+# =============================================================================
+# This bug is  present with ALTER PUBLICATION ... SET TABLE.
+# Specify a subset of tables present in the publication, ShareRowExclusiveLock
+# was not taken for the tables that were dropped as part of SET TABLE operation.
+# =============================================================================
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab1_conc_child VALUES (4);
+]);
+
+# This operation will wait because an open transaction is holding a lock on the
+# publication's relation.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION regress_pub1 SET TABLE tab_conc;\n");
+
+# Check that the tab1_conc_child table, which is set to be removed from the
+# publication, is waiting to acquire a ShareRowExclusiveLock due to the open
+# transaction.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation IN ('tab1_conc_child'::regclass) AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Verify that ShareRowExclusiveLock lock is acquired for tab_conc.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation IN ('tab_conc'::regclass) AND mode = 'ShareRowExclusiveLock' AND waitstart IS NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+# Wait till the table is removed from the publication.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 0 FROM pg_publication_rel WHERE prrelid IN ('tab1_conc_child'::regclass);"
+  )
+  or die
+  "Timed out while waiting for alter publication to add the table to the publication";
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert before SET PUBLICATION is replicated to the subscriber.
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table tab1_conc_child before removing table from publication is replicated to the subscriber'
+);
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab1_conc_child VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Confirm that the insertion following SET PUBLICATION, which will remove the
+# relation from the publication, will not be replicated to the subscriber.
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM tab1_conc_child");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table tab1_conc_child after removing table from publication is not replicated to the subscriber'
+);
+
+# =============================================================================
+# This bug is present with ALTER PUBLICATION ... DROP TABLE.
+# =============================================================================
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+]);
+
+# This operation will wait because there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc;\n");
+
+# Verify that the child table addition is waiting to acquire a
+# ShareRowExclusiveLock.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab_conc'::regclass AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+# Wait till the tables are dropped from the publication.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 0 FROM pg_publication_rel WHERE prrelid IN ('tab_conc'::regclass);"
+  )
+  or die
+  "Timed out while waiting for alter publication to add the table to the publication";
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert before drop table is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table tab_conc before removing table from publication is replicated to the subscriber'
+);
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert fter drop table is not replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table tab_conc after removing table from publication is not replicated to the subscriber'
+);
+
+# =============================================================================
+# This bug is present with ADD TABLES IN SCHEMA too.
+# =============================================================================
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+]);
+
+# Add this schema to the publication, this operation will wait because
+# there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION regress_pub1 ADD TABLES IN SCHEMA sch3, sch4;\n");
+
+# Verify that the schema addition is waiting to acquire a ShareRowExclusiveLock.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = (SELECT oid FROM pg_class WHERE relname = 'tab_conc' AND relnamespace = 'sch3'::regnamespace) AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+$node_publisher->safe_psql('postgres',
+	qq(INSERT INTO sch3.tab_conc VALUES (2);));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# =============================================================================
+# This bug is present with SET TABLES IN SCHEMA too.
+# =============================================================================
+
+# Maintain an active transaction with a schema table that will be removed from
+# the publication.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch4.tab_conc VALUES (1);
+]);
+
+# Set a subset of schema to the publication.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION regress_pub1 SET TABLES IN SCHEMA sch3;\n");
+
+# Verify that the sch4.tab_conc table which will be removed from the
+# publication is waiting to acquire a ShareRowExclusiveLock because of the open
+# transaction.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = (SELECT oid FROM pg_class WHERE relname = 'tab_conc' AND relnamespace = 'sch4'::regnamespace) AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+# Wait till the table is removed from the publication.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 0 FROM pg_publication_namespace WHERE pnnspid IN ('sch4'::regnamespace);"
+  )
+  or die
+  "Timed out while waiting for alter publication to add the table to the publication";
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert before SET TABLES IN SCHEMA is replicated to the subscriber.
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch4.tab_conc");
+is($result, qq(1),
+	'Verify that the incremental data for table sch4.tab_conc before removing table from publication is replicated to the subscriber'
+);
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO sch4.tab_conc VALUES (2);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert after SET TABLES IN SCHEMA is not replicated to the subscriber.
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch4.tab_conc");
+is($result, qq(1),
+	'Verify that the incremental data for table sch4.tab_conc after removing table from publication is not replicated to the subscriber'
+);
+
+# =============================================================================
+# This bug is present with DROP TABLES IN SCHEMA too.
+# =============================================================================
+
+# Maintain an active transaction with a schema table that will be dropped from
+# the publication.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (4);
+]);
+
+# DROP this schema to the publication, this operation will wait because
+# there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION regress_pub1 DROP TABLES IN SCHEMA sch3;\n");
+
+# Verify that the sch3.tab_conc table which will be dropped from the
+# publicaiton is waiting to acquire a ShareRowExclusiveLock.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = (SELECT oid FROM pg_class WHERE relname = 'tab_conc' AND relnamespace = 'sch3'::regnamespace) AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Ensure that the data from the sch3.tab_conc table is replicated to the subscriber before drop tables in schema from publication'
+);
+
+$node_publisher->safe_psql('postgres',
+	qq(INSERT INTO sch3.tab_conc VALUES (5);));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Ensure that the data from the sch3.tab_conc table is not replicated after drop tables in schema from the publication'
+);
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.34.1

#26

Nitin Motiani

nitinmotiani@google.com

over 1 year ago

In reply to: vignesh C (#25)

Re: long-standing data loss bug in initial sync of logical replication

On Mon, Jul 15, 2024 at 11:42 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, 15 Jul 2024 at 15:31, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 11, 2024 at 6:19 PM Nitin Motiani <nitinmotiani@google.com> wrote:

I looked further into the scenario of adding the tables in schema to
the publication. Since in that case, the entry is added to
pg_publication_namespace instead of pg_publication_rel, the codepaths
for 'add table' and 'add tables in schema' are different. And in the
'add tables in schema' scenario, the OpenTableList function is not
called to get the relation ids. Therefore even with the proposed
patch, the data loss issue still persists in that case.

To validate this idea, I tried locking all the affected tables in the
schema just before the invalidation for those relations (in
ShareRowExclusiveLock mode).

This sounds like a reasonable approach to fix the issue. However, we
should check SET publication_object as well, especially the drop part
in it. It should not happen that we miss sending the data for ADD but
for DROP, we send data when we shouldn't have sent it.

There were few other scenarios, similar to the one you mentioned,
where the issue occurred. For example: a) When specifying a subset of
existing tables in the ALTER PUBLICATION ... SET TABLE command, the
tables that were supposed to be removed from the publication were not
locked in ShareRowExclusiveLock mode. b) The ALTER PUBLICATION ...
DROP TABLES IN SCHEMA command did not lock the relations that will be
removed from the publication in ShareRowExclusiveLock mode. Both of
these scenarios resulted in data inconsistency due to inadequate
locking. The attached patch addresses these issues.

Hi,

A couple of questions on the latest patch :

1. I see there is this logic in PublicationDropSchemas to first check
if there is a valid entry for the schema in pg_publication_namespace

psid = GetSysCacheOid2(PUBLICATIONNAMESPACEMAP,

Anum_pg_publication_namespace_oid,

ObjectIdGetDatum(schemaid),

ObjectIdGetDatum(pubid));
if (!OidIsValid(psid))
{
if (missing_ok)
continue;

ereport(ERROR,
(errcode(ERRCODE_UNDEFINED_OBJECT),
errmsg("tables from schema
\"%s\" are not part of the publication",

get_namespace_name(schemaid))));
}

Your proposed change locks the schemaRels before this code block.
Would it be better to lock the schemaRels after the error check? So
that just in case, the publication on the schema is not valid anymore,
the lock is not held unnecessarily on all its tables.

2. The function publication_add_schema explicitly invalidates cache by
calling InvalidatePublicationRels(schemaRels). That is not present in
the current PublicationDropSchemas code. Is that something which
should be added in the drop scenario also? Please let me know if there
is some context that I'm missing regarding why this was not added
originally for the drop scenario.

Thanks & Regards,
Nitin Motiani
Google

#27

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Nitin Motiani (#26)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, Jul 16, 2024 at 12:48 AM Nitin Motiani <nitinmotiani@google.com> wrote:

A couple of questions on the latest patch :

1. I see there is this logic in PublicationDropSchemas to first check
if there is a valid entry for the schema in pg_publication_namespace

psid = GetSysCacheOid2(PUBLICATIONNAMESPACEMAP,

Anum_pg_publication_namespace_oid,

ObjectIdGetDatum(schemaid),

ObjectIdGetDatum(pubid));
if (!OidIsValid(psid))
{
if (missing_ok)
continue;

ereport(ERROR,
(errcode(ERRCODE_UNDEFINED_OBJECT),
errmsg("tables from schema
\"%s\" are not part of the publication",

get_namespace_name(schemaid))));
}

Your proposed change locks the schemaRels before this code block.
Would it be better to lock the schemaRels after the error check? So
that just in case, the publication on the schema is not valid anymore,
the lock is not held unnecessarily on all its tables.

Good point. It is better to lock the relations in
RemovePublicationSchemaById() where we are invalidating relcache as
well. See the response to your next point as well.

2. The function publication_add_schema explicitly invalidates cache by
calling InvalidatePublicationRels(schemaRels). That is not present in
the current PublicationDropSchemas code. Is that something which
should be added in the drop scenario also? Please let me know if there
is some context that I'm missing regarding why this was not added
originally for the drop scenario.

The required invalidation happens in the function
RemovePublicationSchemaById(). So, we should lock in
RemovePublicationSchemaById() as that would avoid calling
GetSchemaPublicationRelations() multiple times.

One related comment:
@@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt
*stmt, HeapTuple tup,
  oldrel = palloc(sizeof(PublicationRelInfo));
  oldrel->whereClause = NULL;
  oldrel->columns = NIL;
+
+ /*
+ * Data loss due to concurrency issues are avoided by locking
+ * the relation in ShareRowExclusiveLock as described atop
+ * OpenTableList.
+ */
  oldrel->relation = table_open(oldrelid,
-   ShareUpdateExclusiveLock);
+   ShareRowExclusiveLock);

Isn't it better to lock the required relations in RemovePublicationRelById()?

--
With Regards,
Amit Kapila.

#28

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Amit Kapila (#27)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, Jul 16, 2024 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

One related comment:
@@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt
*stmt, HeapTuple tup,
oldrel = palloc(sizeof(PublicationRelInfo));
oldrel->whereClause = NULL;
oldrel->columns = NIL;
+
+ /*
+ * Data loss due to concurrency issues are avoided by locking
+ * the relation in ShareRowExclusiveLock as described atop
+ * OpenTableList.
+ */
oldrel->relation = table_open(oldrelid,
-   ShareUpdateExclusiveLock);
+   ShareRowExclusiveLock);

Isn't it better to lock the required relations in RemovePublicationRelById()?

On my CentOS VM, the test file '100_bugs.pl' takes ~11s without a
patch and ~13.3s with a patch. So, 2 to 2.3s additional time for newly
added tests. It isn't worth adding this much extra time for one bug
fix. Can we combine table and schema tests into one single test and
avoid inheritance table tests as the code for those will mostly follow
the same path as a regular table?

--
With Regards,
Amit Kapila.

#29

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: Amit Kapila (#28)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, 16 Jul 2024 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 16, 2024 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
One related comment:
@@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt
*stmt, HeapTuple tup,
oldrel = palloc(sizeof(PublicationRelInfo));
oldrel->whereClause = NULL;
oldrel->columns = NIL;
+
+ /*
+ * Data loss due to concurrency issues are avoided by locking
+ * the relation in ShareRowExclusiveLock as described atop
+ * OpenTableList.
+ */
oldrel->relation = table_open(oldrelid,
-   ShareUpdateExclusiveLock);
+   ShareRowExclusiveLock);
Isn't it better to lock the required relations in RemovePublicationRelById()?
On my CentOS VM, the test file '100_bugs.pl' takes ~11s without a
patch and ~13.3s with a patch. So, 2 to 2.3s additional time for newly
added tests. It isn't worth adding this much extra time for one bug
fix. Can we combine table and schema tests into one single test and
avoid inheritance table tests as the code for those will mostly follow
the same path as a regular table?

Yes, that is better. The attached v6 version patch has the changes for the same.
The patch also addresses the comments from [1]/messages/by-id/CAA4eK1LZDW2AVDYFZdZcvmsKVGajH2-gZmjXr9BsYiy8ct_fEw@mail.gmail.com.

[1]: /messages/by-id/CAA4eK1LZDW2AVDYFZdZcvmsKVGajH2-gZmjXr9BsYiy8ct_fEw@mail.gmail.com

Regards,
Vignesh

Attachments:

v6-0001-Fix-data-loss-during-initial-sync-in-logical-repl.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Fix-data-loss-during-initial-sync-in-logical-repl.patchDownload

From f09ac0daf8914a264a710fb27983560086a97742 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Tue, 9 Jul 2024 19:23:10 +0530
Subject: [PATCH v6] Fix data loss during initial sync in logical replication.

Previously, when adding tables to a publication in PostgreSQL, they were
locked using ShareUpdateExclusiveLock mode. This mode allowed the lock to
succeed even if there were ongoing DML transactions on that table. As a
consequence, the ALTER PUBLICATION command could be completed before these
transactions, leading to a scenario where the catalog snapshot used for
replication did not include changes from transactions initiated before the
alteration.

To fix this issue, tables are now locked using ShareRowExclusiveLock mode
during the addition to a publication. This change ensures that the
ALTER PUBLICATION command waits for any ongoing transactions on the tables
(to be added to the publication) to be completed before proceeding. As a
result, transactions initiated before the publication alteration are
correctly included in the replication process.

The same issue arose with operations such as a) ALTER PUBLICATION ... DROP
TABLE, b) ALTER PUBLICATION ... SET TABLE, c) ALTER PUBLICATION ... ADD
TABLES IN SCHEMA, d) ALTER PUBLICATION ... SET TABLES IN SCHEMA and
e) ALTER PUBLICATION ... DROP TABLES IN SCHEMA. This occurred due to
tables not being locked during the ALTER PUBLICATION process.

To address this, the tables of the publication are now locked using
ShareRowExclusiveLock mode during the ALTER PUBLICATION command. This
modification ensures that the ALTER PUBLICATION command waits until
ongoing transactions are completed before proceeding.

Reported-by: Tomas Vondra
Diagnosed-by: Andres Freund
Author: Vignesh C, Tomas Vondra
Reviewed-by: Amit Kapila
Backpatch-through: 12
Discussion: https://postgr.es/m/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com
---
 src/backend/catalog/pg_publication.c   |   9 ++
 src/backend/commands/publicationcmds.c |  31 ++++-
 src/test/subscription/t/100_bugs.pl    | 156 +++++++++++++++++++++++++
 3 files changed, 191 insertions(+), 5 deletions(-)

diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index 0602398a54..a7c257a994 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -32,6 +32,7 @@
 #include "catalog/pg_type.h"
 #include "commands/publicationcmds.h"
 #include "funcapi.h"
+#include "storage/lmgr.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
@@ -677,6 +678,14 @@ publication_add_schema(Oid pubid, Oid schemaid, bool if_not_exists)
 	 */
 	schemaRels = GetSchemaPublicationRelations(schemaid,
 											   PUBLICATION_PART_ALL);
+
+	/*
+	 * Data loss due to concurrency issues are avoided by locking the relation
+	 * in ShareRowExclusiveLock as described atop OpenTableList.
+	 */
+	foreach_oid(schrelid, schemaRels)
+		LockRelationOid(schrelid, ShareRowExclusiveLock);
+
 	InvalidatePublicationRels(schemaRels);
 
 	return myself;
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 6ea709988e..9d9b5f6af9 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -1466,6 +1466,13 @@ RemovePublicationRelById(Oid proid)
 	relids = GetPubPartitionOptionRelations(relids, PUBLICATION_PART_ALL,
 											pubrel->prrelid);
 
+	/*
+	 * Data loss due to concurrency issues are avoided by locking the relation
+	 * in ShareRowExclusiveLock as described atop OpenTableList.
+	 */
+	foreach_oid(relid, relids)
+		LockRelationOid(relid, ShareRowExclusiveLock);
+
 	InvalidatePublicationRels(relids);
 
 	CatalogTupleDelete(rel, &tup->t_self);
@@ -1531,6 +1538,14 @@ RemovePublicationSchemaById(Oid psoid)
 	 */
 	schemaRels = GetSchemaPublicationRelations(pubsch->pnnspid,
 											   PUBLICATION_PART_ALL);
+
+	/*
+	 * Data loss due to concurrency issues are avoided by locking the relation
+	 * in ShareRowExclusiveLock as described atop OpenTableList.
+	 */
+	foreach_oid(schrelid, schemaRels)
+		LockRelationOid(schrelid, ShareRowExclusiveLock);
+
 	InvalidatePublicationRels(schemaRels);
 
 	CatalogTupleDelete(rel, &tup->t_self);
@@ -1542,8 +1557,14 @@ RemovePublicationSchemaById(Oid psoid)
 
 /*
  * Open relations specified by a PublicationTable list.
- * The returned tables are locked in ShareUpdateExclusiveLock mode in order to
- * add them to a publication.
+ *
+ * The returned tables are locked in ShareRowExclusiveLock mode to add them
+ * to a publication. The table needs to be locked in ShareRowExclusiveLock
+ * mode to ensure that any ongoing transactions involving that table are
+ * completed before adding it to the publication. Otherwise, the transaction
+ * initiated before the alteration of the publication will continue to use a
+ * catalog snapshot predating the publication change, leading to
+ * non-replication of these transaction changes.
  */
 static List *
 OpenTableList(List *tables)
@@ -1568,7 +1589,7 @@ OpenTableList(List *tables)
 		/* Allow query cancel in case this takes a long time */
 		CHECK_FOR_INTERRUPTS();
 
-		rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+		rel = table_openrv(t->relation, ShareRowExclusiveLock);
 		myrelid = RelationGetRelid(rel);
 
 		/*
@@ -1594,7 +1615,7 @@ OpenTableList(List *tables)
 						 errmsg("conflicting or redundant column lists for table \"%s\"",
 								RelationGetRelationName(rel))));
 
-			table_close(rel, ShareUpdateExclusiveLock);
+			table_close(rel, ShareRowExclusiveLock);
 			continue;
 		}
 
@@ -1622,7 +1643,7 @@ OpenTableList(List *tables)
 			List	   *children;
 			ListCell   *child;
 
-			children = find_all_inheritors(myrelid, ShareUpdateExclusiveLock,
+			children = find_all_inheritors(myrelid, ShareRowExclusiveLock,
 										   NULL);
 
 			foreach(child, children)
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..04c75b7806 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,162 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', qq(DROP  PUBLICATION pub1;));
+$node_subscriber->safe_psql('postgres', qq(DROP  SUBSCRIPTION sub1;));
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SCHEMA sch4;
+	CREATE TABLE sch4.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SCHEMA sch4;
+	CREATE TABLE sch4.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will wait for the lock and can only be completed after
+$background_psql3->query_until(qr//,
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch4, sch3;\n"
+);
+
+# Verify that the table addition is waiting to acquire a ShareRowExclusiveLock.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = 'tab_conc'::regclass AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Complete the transaction on the tables, so that ALTER PUBLICATION can proceed
+# further to acquire locks on the schema table.
+$background_psql1->query_safe(qq[COMMIT]);
+
+# Verify that ShareRowExclusiveLock is acquired on sch4.tab_conc for which
+# there is no on-going transaction.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = (SELECT oid FROM pg_class WHERE relname = 'tab_conc' AND relnamespace = 'sch4'::regnamespace) AND mode = 'ShareRowExclusiveLock' AND waitstart IS NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+# Verify that the schema addition is waiting to acquire a ShareRowExclusiveLock
+# for the table tab_conc which has an on-going transaction.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_locks WHERE relation = (SELECT oid FROM pg_class WHERE relname = 'tab_conc' AND relnamespace = 'sch3'::regnamespace) AND mode = 'ShareRowExclusiveLock' AND waitstart IS NOT NULL;"
+  )
+  or die
+  "Timed out while waiting for alter publication tries to wait on ShareRowExclusiveLock";
+
+$background_psql2->query_safe(qq[COMMIT]);
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.34.1

#30

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: vignesh C (#29)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 16 Jul 2024 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Jul 16, 2024 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
One related comment:
@@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt
*stmt, HeapTuple tup,
oldrel = palloc(sizeof(PublicationRelInfo));
oldrel->whereClause = NULL;
oldrel->columns = NIL;
+
+ /*
+ * Data loss due to concurrency issues are avoided by locking
+ * the relation in ShareRowExclusiveLock as described atop
+ * OpenTableList.
+ */
oldrel->relation = table_open(oldrelid,
-   ShareUpdateExclusiveLock);
+   ShareRowExclusiveLock);
Isn't it better to lock the required relations in RemovePublicationRelById()?
On my CentOS VM, the test file '100_bugs.pl' takes ~11s without a
patch and ~13.3s with a patch. So, 2 to 2.3s additional time for newly
added tests. It isn't worth adding this much extra time for one bug
fix. Can we combine table and schema tests into one single test and
avoid inheritance table tests as the code for those will mostly follow
the same path as a regular table?
Yes, that is better. The attached v6 version patch has the changes for the same.
The patch also addresses the comments from [1].

Thanks, I don't see any noticeable difference in test timing with new
tests. I have slightly modified the comments in the attached diff
patch (please rename it to .patch).

BTW, I noticed that we don't take any table-level locks for Create
Publication .. For ALL TABLES (and Drop Publication). Can that create
a similar problem? I haven't tested so not sure but even if there is a
problem for the Create case, it should lead to some ERROR like missing
publication.

--
With Regards,
Amit Kapila.

Attachments:

v6-topup-amit.patch.txttext/plain; charset=US-ASCII; name=v6-topup-amit.patch.txtDownload

diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index a7c257a994..a274ec0f7e 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -680,8 +680,8 @@ publication_add_schema(Oid pubid, Oid schemaid, bool if_not_exists)
 											   PUBLICATION_PART_ALL);
 
 	/*
-	 * Data loss due to concurrency issues are avoided by locking the relation
-	 * in ShareRowExclusiveLock as described atop OpenTableList.
+	 * Lock the tables so that concurrent transactions don't miss replicating
+	 * the changes. See comments atop OpenTableList for further details.
 	 */
 	foreach_oid(schrelid, schemaRels)
 		LockRelationOid(schrelid, ShareRowExclusiveLock);
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 9d9b5f6af9..95f83d5563 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -1467,8 +1467,8 @@ RemovePublicationRelById(Oid proid)
 											pubrel->prrelid);
 
 	/*
-	 * Data loss due to concurrency issues are avoided by locking the relation
-	 * in ShareRowExclusiveLock as described atop OpenTableList.
+	 * Lock the tables to avoid concurrent transactions from replicating the
+	 * changes. See comments atop OpenTableList for further details.
 	 */
 	foreach_oid(relid, relids)
 		LockRelationOid(relid, ShareRowExclusiveLock);
@@ -1540,8 +1540,8 @@ RemovePublicationSchemaById(Oid psoid)
 											   PUBLICATION_PART_ALL);
 
 	/*
-	 * Data loss due to concurrency issues are avoided by locking the relation
-	 * in ShareRowExclusiveLock as described atop OpenTableList.
+	 * Lock the tables to avoid concurrent transactions from replicating the
+	 * changes. See comments atop OpenTableList for further details.
 	 */
 	foreach_oid(schrelid, schemaRels)
 		LockRelationOid(schrelid, ShareRowExclusiveLock);

#31

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: Amit Kapila (#30)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:
On Tue, 16 Jul 2024 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Jul 16, 2024 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
One related comment:
@@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt
*stmt, HeapTuple tup,
oldrel = palloc(sizeof(PublicationRelInfo));
oldrel->whereClause = NULL;
oldrel->columns = NIL;
+
+ /*
+ * Data loss due to concurrency issues are avoided by locking
+ * the relation in ShareRowExclusiveLock as described atop
+ * OpenTableList.
+ */
oldrel->relation = table_open(oldrelid,
-   ShareUpdateExclusiveLock);
+   ShareRowExclusiveLock);
Isn't it better to lock the required relations in RemovePublicationRelById()?
On my CentOS VM, the test file '100_bugs.pl' takes ~11s without a
patch and ~13.3s with a patch. So, 2 to 2.3s additional time for newly
added tests. It isn't worth adding this much extra time for one bug
fix. Can we combine table and schema tests into one single test and
avoid inheritance table tests as the code for those will mostly follow
the same path as a regular table?
Yes, that is better. The attached v6 version patch has the changes for the same.
The patch also addresses the comments from [1].
Thanks, I don't see any noticeable difference in test timing with new
tests. I have slightly modified the comments in the attached diff
patch (please rename it to .patch).

BTW, I noticed that we don't take any table-level locks for Create
Publication .. For ALL TABLES (and Drop Publication). Can that create
a similar problem? I haven't tested so not sure but even if there is a
problem for the Create case, it should lead to some ERROR like missing
publication.

I tested these scenarios, and as you expected, it throws an error for
the create publication case:
2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
data from WAL stream: ERROR: publication "pub1" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change
callback, associated LSN 0/1510CD8
2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
"logical replication apply worker" (PID 481526) exited with exit code
1

The steps for this process are as follows:
1) Create tables in both the publisher and subscriber.
2) On the publisher: Create a replication slot.
3) On the subscriber: Create a subscription using the slot created by
the publisher.
4) On the publisher:
4.a) Session 1: BEGIN; INSERT INTO T1;
4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
4.c) Session 1: COMMIT;

Since we are throwing out a "publication does not exist" error, there
is no inconsistency issue here.

However, an issue persists with DROP ALL TABLES publication, where
data continues to replicate even after the publication is dropped.
This happens because the open transaction consumes the invalidation,
causing the publications to be revalidated using old snapshot. As a
result, both the open transactions and the subsequent transactions are
getting replicated.

We can reproduce this issue by following these steps in a logical
replication setup with an "ALL TABLES" publication:
On the publisher:
Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
In another session on the publisher:
Session 2: DROP PUBLICATION
Back in Session 1 on the publisher:
COMMIT;
Finally, in Session 1 on the publisher:
INSERT INTO T1 VALUES (val2);

Even after dropping the publication, both val1 and val2 are still
being replicated to the subscriber. This means that both the
in-progress concurrent transaction and the subsequent transactions are
being replicated.

I don't think locking all tables is a viable solution in this case, as
it would require asking the user to refrain from performing any
operations on any of the tables in the database while creating a
publication.

Thoughts?

Regards,
Vignesh

#32

Nitin Motiani

nitinmotiani@google.com

over 1 year ago

In reply to: vignesh C (#31)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:
On Tue, 16 Jul 2024 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Jul 16, 2024 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
One related comment:
@@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt
*stmt, HeapTuple tup,
oldrel = palloc(sizeof(PublicationRelInfo));
oldrel->whereClause = NULL;
oldrel->columns = NIL;
+
+ /*
+ * Data loss due to concurrency issues are avoided by locking
+ * the relation in ShareRowExclusiveLock as described atop
+ * OpenTableList.
+ */
oldrel->relation = table_open(oldrelid,
-   ShareUpdateExclusiveLock);
+   ShareRowExclusiveLock);
Isn't it better to lock the required relations in RemovePublicationRelById()?
On my CentOS VM, the test file '100_bugs.pl' takes ~11s without a
patch and ~13.3s with a patch. So, 2 to 2.3s additional time for newly
added tests. It isn't worth adding this much extra time for one bug
fix. Can we combine table and schema tests into one single test and
avoid inheritance table tests as the code for those will mostly follow
the same path as a regular table?
Yes, that is better. The attached v6 version patch has the changes for the same.
The patch also addresses the comments from [1].
Thanks, I don't see any noticeable difference in test timing with new
tests. I have slightly modified the comments in the attached diff
patch (please rename it to .patch).

BTW, I noticed that we don't take any table-level locks for Create
Publication .. For ALL TABLES (and Drop Publication). Can that create
a similar problem? I haven't tested so not sure but even if there is a
problem for the Create case, it should lead to some ERROR like missing
publication.
I tested these scenarios, and as you expected, it throws an error for
the create publication case:
2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
data from WAL stream: ERROR: publication "pub1" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change
callback, associated LSN 0/1510CD8
2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
"logical replication apply worker" (PID 481526) exited with exit code
1

The steps for this process are as follows:
1) Create tables in both the publisher and subscriber.
2) On the publisher: Create a replication slot.
3) On the subscriber: Create a subscription using the slot created by
the publisher.
4) On the publisher:
4.a) Session 1: BEGIN; INSERT INTO T1;
4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
4.c) Session 1: COMMIT;

Since we are throwing out a "publication does not exist" error, there
is no inconsistency issue here.

However, an issue persists with DROP ALL TABLES publication, where
data continues to replicate even after the publication is dropped.
This happens because the open transaction consumes the invalidation,
causing the publications to be revalidated using old snapshot. As a
result, both the open transactions and the subsequent transactions are
getting replicated.

We can reproduce this issue by following these steps in a logical
replication setup with an "ALL TABLES" publication:
On the publisher:
Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
In another session on the publisher:
Session 2: DROP PUBLICATION
Back in Session 1 on the publisher:
COMMIT;
Finally, in Session 1 on the publisher:
INSERT INTO T1 VALUES (val2);

Even after dropping the publication, both val1 and val2 are still
being replicated to the subscriber. This means that both the
in-progress concurrent transaction and the subsequent transactions are
being replicated.

Hi,

I tried the 'DROP PUBLICATION' command even for a publication with a
single table. And there also the data continues to get replicated.

To test this, I did a similar experiment as the above but instead of
creating publication on all tables, I did it for one specific table.

Here are the steps :
1. Create table test_1 and test_2 on both the publisher and subscriber
instances.
2. Create publication p for table test_1 on the publisher.
3. Create a subscription s which subscribes to p.
4. On the publisher
4a) Session 1 : BEGIN; INSERT INTO test_1 VALUES(val1);
4b) Session 2 : DROP PUBLICATION p;
4c) Session 1 : Commit;
5. On the publisher : INSERT INTO test_1 VALUES(val2);

After these, when I check the subscriber, both val1 and val2 have been
replicated. I tried a few more inserts on publisher after this and
they all got replicated to the subscriber. Only after explicitly
creating a new publication p2 for test_1 on the publisher, the
replication stopped. Most likely because the create publication
command invalidated the cache.

My guess is that this issue probably comes from the fact that
RemoveObjects in dropcmds.c doesn't do any special handling or
invalidation for the object drop command.

Please let me know if I'm missing something in my setup or if my
understanding of the drop commands is wrong.

Thanks

#33

Nitin Motiani

nitinmotiani@google.com

over 1 year ago

In reply to: Nitin Motiani (#32)

Re: long-standing data loss bug in initial sync of logical replication

On Thu, Jul 18, 2024 at 3:05 PM Nitin Motiani <nitinmotiani@google.com> wrote:

On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:

I tested these scenarios, and as you expected, it throws an error for
the create publication case:
2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
data from WAL stream: ERROR: publication "pub1" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change
callback, associated LSN 0/1510CD8
2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
"logical replication apply worker" (PID 481526) exited with exit code
1

The steps for this process are as follows:
1) Create tables in both the publisher and subscriber.
2) On the publisher: Create a replication slot.
3) On the subscriber: Create a subscription using the slot created by
the publisher.
4) On the publisher:
4.a) Session 1: BEGIN; INSERT INTO T1;
4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
4.c) Session 1: COMMIT;

Since we are throwing out a "publication does not exist" error, there
is no inconsistency issue here.

However, an issue persists with DROP ALL TABLES publication, where
data continues to replicate even after the publication is dropped.
This happens because the open transaction consumes the invalidation,
causing the publications to be revalidated using old snapshot. As a
result, both the open transactions and the subsequent transactions are
getting replicated.

We can reproduce this issue by following these steps in a logical
replication setup with an "ALL TABLES" publication:
On the publisher:
Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
In another session on the publisher:
Session 2: DROP PUBLICATION
Back in Session 1 on the publisher:
COMMIT;
Finally, in Session 1 on the publisher:
INSERT INTO T1 VALUES (val2);

Even after dropping the publication, both val1 and val2 are still
being replicated to the subscriber. This means that both the
in-progress concurrent transaction and the subsequent transactions are
being replicated.

Hi,

I tried the 'DROP PUBLICATION' command even for a publication with a
single table. And there also the data continues to get replicated.

To test this, I did a similar experiment as the above but instead of
creating publication on all tables, I did it for one specific table.

Here are the steps :
1. Create table test_1 and test_2 on both the publisher and subscriber
instances.
2. Create publication p for table test_1 on the publisher.
3. Create a subscription s which subscribes to p.
4. On the publisher
4a) Session 1 : BEGIN; INSERT INTO test_1 VALUES(val1);
4b) Session 2 : DROP PUBLICATION p;
4c) Session 1 : Commit;
5. On the publisher : INSERT INTO test_1 VALUES(val2);

After these, when I check the subscriber, both val1 and val2 have been
replicated. I tried a few more inserts on publisher after this and
they all got replicated to the subscriber. Only after explicitly
creating a new publication p2 for test_1 on the publisher, the
replication stopped. Most likely because the create publication
command invalidated the cache.

My guess is that this issue probably comes from the fact that
RemoveObjects in dropcmds.c doesn't do any special handling or
invalidation for the object drop command.

I checked further and I see that RemovePublicationById does do cache
invalidation but it is only done in the scenario when the publication
is on all tables. This is done without taking any locks. But for the
other cases (eg. publication on one table), I don't see any cache
invalidation in RemovePublicationById. That would explain why the
replication kept happening for multiple transactions after the drop
publication command in my example..

Thanks & Regards
Nitin Motiani
Google

#34

Nitin Motiani

nitinmotiani@google.com

over 1 year ago

In reply to: Nitin Motiani (#33)

Re: long-standing data loss bug in initial sync of logical replication

On Thu, Jul 18, 2024 at 3:25 PM Nitin Motiani <nitinmotiani@google.com> wrote:

On Thu, Jul 18, 2024 at 3:05 PM Nitin Motiani <nitinmotiani@google.com> wrote:

On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:

I tested these scenarios, and as you expected, it throws an error for
the create publication case:
2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
data from WAL stream: ERROR: publication "pub1" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change
callback, associated LSN 0/1510CD8
2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
"logical replication apply worker" (PID 481526) exited with exit code
1

The steps for this process are as follows:
1) Create tables in both the publisher and subscriber.
2) On the publisher: Create a replication slot.
3) On the subscriber: Create a subscription using the slot created by
the publisher.
4) On the publisher:
4.a) Session 1: BEGIN; INSERT INTO T1;
4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
4.c) Session 1: COMMIT;

Since we are throwing out a "publication does not exist" error, there
is no inconsistency issue here.

However, an issue persists with DROP ALL TABLES publication, where
data continues to replicate even after the publication is dropped.
This happens because the open transaction consumes the invalidation,
causing the publications to be revalidated using old snapshot. As a
result, both the open transactions and the subsequent transactions are
getting replicated.

We can reproduce this issue by following these steps in a logical
replication setup with an "ALL TABLES" publication:
On the publisher:
Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
In another session on the publisher:
Session 2: DROP PUBLICATION
Back in Session 1 on the publisher:
COMMIT;
Finally, in Session 1 on the publisher:
INSERT INTO T1 VALUES (val2);

Even after dropping the publication, both val1 and val2 are still
being replicated to the subscriber. This means that both the
in-progress concurrent transaction and the subsequent transactions are
being replicated.

Hi,

I tried the 'DROP PUBLICATION' command even for a publication with a
single table. And there also the data continues to get replicated.

To test this, I did a similar experiment as the above but instead of
creating publication on all tables, I did it for one specific table.

Here are the steps :
1. Create table test_1 and test_2 on both the publisher and subscriber
instances.
2. Create publication p for table test_1 on the publisher.
3. Create a subscription s which subscribes to p.
4. On the publisher
4a) Session 1 : BEGIN; INSERT INTO test_1 VALUES(val1);
4b) Session 2 : DROP PUBLICATION p;
4c) Session 1 : Commit;
5. On the publisher : INSERT INTO test_1 VALUES(val2);

After these, when I check the subscriber, both val1 and val2 have been
replicated. I tried a few more inserts on publisher after this and
they all got replicated to the subscriber. Only after explicitly
creating a new publication p2 for test_1 on the publisher, the
replication stopped. Most likely because the create publication
command invalidated the cache.

My guess is that this issue probably comes from the fact that
RemoveObjects in dropcmds.c doesn't do any special handling or
invalidation for the object drop command.

I checked further and I see that RemovePublicationById does do cache
invalidation but it is only done in the scenario when the publication
is on all tables. This is done without taking any locks. But for the
other cases (eg. publication on one table), I don't see any cache
invalidation in RemovePublicationById. That would explain why the
replication kept happening for multiple transactions after the drop
publication command in my example..

Sorry, I missed that for the individual table scenario, the
invalidation would happen in RemovePublicationRelById. That is
invalidating the cache for all relids. But this is also not taking any
locks. So that would explain why dropping the publication on a single
table doesn't invalidate the cache in an ongoing transaction. I'm not
sure why the replication kept happening even in subsequent
transactions.

Either way I think the SRE lock should be taken for all relids in that
function also before the invalidations.

Thanks & Regards
Nitin Motiani
Google

#35

Nitin Motiani

nitinmotiani@google.com

over 1 year ago

In reply to: Nitin Motiani (#34)

Re: long-standing data loss bug in initial sync of logical replication

On Thu, Jul 18, 2024 at 3:30 PM Nitin Motiani <nitinmotiani@google.com> wrote:

On Thu, Jul 18, 2024 at 3:25 PM Nitin Motiani <nitinmotiani@google.com> wrote:

On Thu, Jul 18, 2024 at 3:05 PM Nitin Motiani <nitinmotiani@google.com> wrote:

On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:

I tested these scenarios, and as you expected, it throws an error for
the create publication case:
2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
data from WAL stream: ERROR: publication "pub1" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change
callback, associated LSN 0/1510CD8
2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
"logical replication apply worker" (PID 481526) exited with exit code
1

The steps for this process are as follows:
1) Create tables in both the publisher and subscriber.
2) On the publisher: Create a replication slot.
3) On the subscriber: Create a subscription using the slot created by
the publisher.
4) On the publisher:
4.a) Session 1: BEGIN; INSERT INTO T1;
4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
4.c) Session 1: COMMIT;

Since we are throwing out a "publication does not exist" error, there
is no inconsistency issue here.

However, an issue persists with DROP ALL TABLES publication, where
data continues to replicate even after the publication is dropped.
This happens because the open transaction consumes the invalidation,
causing the publications to be revalidated using old snapshot. As a
result, both the open transactions and the subsequent transactions are
getting replicated.

We can reproduce this issue by following these steps in a logical
replication setup with an "ALL TABLES" publication:
On the publisher:
Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
In another session on the publisher:
Session 2: DROP PUBLICATION
Back in Session 1 on the publisher:
COMMIT;
Finally, in Session 1 on the publisher:
INSERT INTO T1 VALUES (val2);

Even after dropping the publication, both val1 and val2 are still
being replicated to the subscriber. This means that both the
in-progress concurrent transaction and the subsequent transactions are
being replicated.

Hi,

I tried the 'DROP PUBLICATION' command even for a publication with a
single table. And there also the data continues to get replicated.

To test this, I did a similar experiment as the above but instead of
creating publication on all tables, I did it for one specific table.

Here are the steps :
1. Create table test_1 and test_2 on both the publisher and subscriber
instances.
2. Create publication p for table test_1 on the publisher.
3. Create a subscription s which subscribes to p.
4. On the publisher
4a) Session 1 : BEGIN; INSERT INTO test_1 VALUES(val1);
4b) Session 2 : DROP PUBLICATION p;
4c) Session 1 : Commit;
5. On the publisher : INSERT INTO test_1 VALUES(val2);

After these, when I check the subscriber, both val1 and val2 have been
replicated. I tried a few more inserts on publisher after this and
they all got replicated to the subscriber. Only after explicitly
creating a new publication p2 for test_1 on the publisher, the
replication stopped. Most likely because the create publication
command invalidated the cache.

My guess is that this issue probably comes from the fact that
RemoveObjects in dropcmds.c doesn't do any special handling or
invalidation for the object drop command.

I checked further and I see that RemovePublicationById does do cache
invalidation but it is only done in the scenario when the publication
is on all tables. This is done without taking any locks. But for the
other cases (eg. publication on one table), I don't see any cache
invalidation in RemovePublicationById. That would explain why the
replication kept happening for multiple transactions after the drop
publication command in my example..

Sorry, I missed that for the individual table scenario, the
invalidation would happen in RemovePublicationRelById. That is
invalidating the cache for all relids. But this is also not taking any
locks. So that would explain why dropping the publication on a single
table doesn't invalidate the cache in an ongoing transaction. I'm not
sure why the replication kept happening even in subsequent
transactions.

Either way I think the SRE lock should be taken for all relids in that
function also before the invalidations.

My apologies. I wasn't testing with the latest patch. I see this has
already been done in the v6 patch file.

Thanks & Regards
Nitin Motiani
Google

#36

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: vignesh C (#31)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

BTW, I noticed that we don't take any table-level locks for Create
Publication .. For ALL TABLES (and Drop Publication). Can that create
a similar problem? I haven't tested so not sure but even if there is a
problem for the Create case, it should lead to some ERROR like missing
publication.

I tested these scenarios, and as you expected, it throws an error for
the create publication case:
2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
data from WAL stream: ERROR: publication "pub1" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change
callback, associated LSN 0/1510CD8
2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
"logical replication apply worker" (PID 481526) exited with exit code
1

The steps for this process are as follows:
1) Create tables in both the publisher and subscriber.
2) On the publisher: Create a replication slot.
3) On the subscriber: Create a subscription using the slot created by
the publisher.
4) On the publisher:
4.a) Session 1: BEGIN; INSERT INTO T1;
4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
4.c) Session 1: COMMIT;

Since we are throwing out a "publication does not exist" error, there
is no inconsistency issue here.

However, an issue persists with DROP ALL TABLES publication, where
data continues to replicate even after the publication is dropped.
This happens because the open transaction consumes the invalidation,
causing the publications to be revalidated using old snapshot. As a
result, both the open transactions and the subsequent transactions are
getting replicated.

We can reproduce this issue by following these steps in a logical
replication setup with an "ALL TABLES" publication:
On the publisher:
Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
In another session on the publisher:
Session 2: DROP PUBLICATION
Back in Session 1 on the publisher:
COMMIT;
Finally, in Session 1 on the publisher:
INSERT INTO T1 VALUES (val2);

Even after dropping the publication, both val1 and val2 are still
being replicated to the subscriber. This means that both the
in-progress concurrent transaction and the subsequent transactions are
being replicated.

I don't think locking all tables is a viable solution in this case, as
it would require asking the user to refrain from performing any
operations on any of the tables in the database while creating a
publication.

Indeed, locking all tables in the database to prevent concurrent DMLs
for this scenario also looks odd to me. The other alternative
previously suggested by Andres is to distribute catalog modifying
transactions to all concurrent in-progress transactions [1]/messages/by-id/20231118025445.crhaeeuvoe2g5dv6@awork3.anarazel.de but as
mentioned this could add an overhead. One possibility to reduce
overhead is that we selectively distribute invalidations for
catalogs-related publications but I haven't analyzed the feasibility.

We need more opinions to decide here, so let me summarize the problem
and solutions discussed. As explained with an example in an email [1]/messages/by-id/20231118025445.crhaeeuvoe2g5dv6@awork3.anarazel.de,
the problem related to logical decoding is that it doesn't process
invalidations corresponding to DDLs for the already in-progress
transactions. We discussed preventing DMLs in the first place when
concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in
progress. The solution discussed was to acquire
ShareUpdateExclusiveLock for all the tables being added via such
commands. Further analysis revealed that the same handling is required
for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all
the tables in the specified schemas. Then DROP PUBLICATION also seems
to have similar symptoms which means in the worst case (where
publication is for ALL TABLES) we have to lock all the tables in the
database. We are not sure if that is good so the other alternative we
can pursue is to distribute invalidations in logical decoding
infrastructure [1]/messages/by-id/20231118025445.crhaeeuvoe2g5dv6@awork3.anarazel.de which has its downsides.

Thoughts?

[1]: /messages/by-id/20231118025445.crhaeeuvoe2g5dv6@awork3.anarazel.de

--
With Regards,
Amit Kapila.

#37

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Amit Kapila (#36)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

BTW, I noticed that we don't take any table-level locks for Create
Publication .. For ALL TABLES (and Drop Publication). Can that create
a similar problem? I haven't tested so not sure but even if there is a
problem for the Create case, it should lead to some ERROR like missing
publication.

I tested these scenarios, and as you expected, it throws an error for
the create publication case:
2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
data from WAL stream: ERROR: publication "pub1" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change
callback, associated LSN 0/1510CD8
2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
"logical replication apply worker" (PID 481526) exited with exit code
1

The steps for this process are as follows:
1) Create tables in both the publisher and subscriber.
2) On the publisher: Create a replication slot.
3) On the subscriber: Create a subscription using the slot created by
the publisher.
4) On the publisher:
4.a) Session 1: BEGIN; INSERT INTO T1;
4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
4.c) Session 1: COMMIT;

Since we are throwing out a "publication does not exist" error, there
is no inconsistency issue here.

However, an issue persists with DROP ALL TABLES publication, where
data continues to replicate even after the publication is dropped.
This happens because the open transaction consumes the invalidation,
causing the publications to be revalidated using old snapshot. As a
result, both the open transactions and the subsequent transactions are
getting replicated.

We can reproduce this issue by following these steps in a logical
replication setup with an "ALL TABLES" publication:
On the publisher:
Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
In another session on the publisher:
Session 2: DROP PUBLICATION
Back in Session 1 on the publisher:
COMMIT;
Finally, in Session 1 on the publisher:
INSERT INTO T1 VALUES (val2);

Even after dropping the publication, both val1 and val2 are still
being replicated to the subscriber. This means that both the
in-progress concurrent transaction and the subsequent transactions are
being replicated.

I don't think locking all tables is a viable solution in this case, as
it would require asking the user to refrain from performing any
operations on any of the tables in the database while creating a
publication.

Indeed, locking all tables in the database to prevent concurrent DMLs
for this scenario also looks odd to me. The other alternative
previously suggested by Andres is to distribute catalog modifying
transactions to all concurrent in-progress transactions [1] but as
mentioned this could add an overhead. One possibility to reduce
overhead is that we selectively distribute invalidations for
catalogs-related publications but I haven't analyzed the feasibility.

We need more opinions to decide here, so let me summarize the problem
and solutions discussed. As explained with an example in an email [1],
the problem related to logical decoding is that it doesn't process
invalidations corresponding to DDLs for the already in-progress
transactions. We discussed preventing DMLs in the first place when
concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in
progress. The solution discussed was to acquire
ShareUpdateExclusiveLock for all the tables being added via such
commands. Further analysis revealed that the same handling is required
for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all
the tables in the specified schemas. Then DROP PUBLICATION also seems
to have similar symptoms which means in the worst case (where
publication is for ALL TABLES) we have to lock all the tables in the
database. We are not sure if that is good so the other alternative we
can pursue is to distribute invalidations in logical decoding
infrastructure [1] which has its downsides.

Thoughts?

Thank you for summarizing the problem and solutions!

I think it's worth trying the idea of distributing invalidation
messages, and we will see if there could be overheads or any further
obstacles. IIUC this approach would resolve another issue we discussed
before too[1]/messages/by-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com.

Regards,

[1]: /messages/by-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#38

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#37)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, Jul 31, 2024 at 3:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

BTW, I noticed that we don't take any table-level locks for Create
Publication .. For ALL TABLES (and Drop Publication). Can that create
a similar problem? I haven't tested so not sure but even if there is a
problem for the Create case, it should lead to some ERROR like missing
publication.

I tested these scenarios, and as you expected, it throws an error for
the create publication case:
2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
data from WAL stream: ERROR: publication "pub1" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change
callback, associated LSN 0/1510CD8
2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
"logical replication apply worker" (PID 481526) exited with exit code
1

The steps for this process are as follows:
1) Create tables in both the publisher and subscriber.
2) On the publisher: Create a replication slot.
3) On the subscriber: Create a subscription using the slot created by
the publisher.
4) On the publisher:
4.a) Session 1: BEGIN; INSERT INTO T1;
4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
4.c) Session 1: COMMIT;

Since we are throwing out a "publication does not exist" error, there
is no inconsistency issue here.

However, an issue persists with DROP ALL TABLES publication, where
data continues to replicate even after the publication is dropped.
This happens because the open transaction consumes the invalidation,
causing the publications to be revalidated using old snapshot. As a
result, both the open transactions and the subsequent transactions are
getting replicated.

We can reproduce this issue by following these steps in a logical
replication setup with an "ALL TABLES" publication:
On the publisher:
Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
In another session on the publisher:
Session 2: DROP PUBLICATION
Back in Session 1 on the publisher:
COMMIT;
Finally, in Session 1 on the publisher:
INSERT INTO T1 VALUES (val2);

Even after dropping the publication, both val1 and val2 are still
being replicated to the subscriber. This means that both the
in-progress concurrent transaction and the subsequent transactions are
being replicated.

I don't think locking all tables is a viable solution in this case, as
it would require asking the user to refrain from performing any
operations on any of the tables in the database while creating a
publication.

Indeed, locking all tables in the database to prevent concurrent DMLs
for this scenario also looks odd to me. The other alternative
previously suggested by Andres is to distribute catalog modifying
transactions to all concurrent in-progress transactions [1] but as
mentioned this could add an overhead. One possibility to reduce
overhead is that we selectively distribute invalidations for
catalogs-related publications but I haven't analyzed the feasibility.

We need more opinions to decide here, so let me summarize the problem
and solutions discussed. As explained with an example in an email [1],
the problem related to logical decoding is that it doesn't process
invalidations corresponding to DDLs for the already in-progress
transactions. We discussed preventing DMLs in the first place when
concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in
progress. The solution discussed was to acquire
ShareUpdateExclusiveLock for all the tables being added via such
commands. Further analysis revealed that the same handling is required
for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all
the tables in the specified schemas. Then DROP PUBLICATION also seems
to have similar symptoms which means in the worst case (where
publication is for ALL TABLES) we have to lock all the tables in the
database. We are not sure if that is good so the other alternative we
can pursue is to distribute invalidations in logical decoding
infrastructure [1] which has its downsides.

Thoughts?

Thank you for summarizing the problem and solutions!

I think it's worth trying the idea of distributing invalidation
messages, and we will see if there could be overheads or any further
obstacles. IIUC this approach would resolve another issue we discussed
before too[1].

Yes, and we also discussed having a similar solution at the time when
that problem was reported. So, it is clear that even though locking
tables can work for commands alter ALTER PUBLICATION ... ADD TABLE
..., we need a solution for distributing invalidations to the
in-progress transactions during logical decoding for other cases as
reported by you previously.

Thanks for looking into this.

--
With Regards,
Amit Kapila.

#39

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Amit Kapila (#38)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, 31 Jul 2024 at 09:36, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 31, 2024 at 3:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

BTW, I noticed that we don't take any table-level locks for Create
Publication .. For ALL TABLES (and Drop Publication). Can that create
a similar problem? I haven't tested so not sure but even if there is a
problem for the Create case, it should lead to some ERROR like missing
publication.

I tested these scenarios, and as you expected, it throws an error for
the create publication case:
2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
data from WAL stream: ERROR: publication "pub1" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change
callback, associated LSN 0/1510CD8
2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
"logical replication apply worker" (PID 481526) exited with exit code
1

The steps for this process are as follows:
1) Create tables in both the publisher and subscriber.
2) On the publisher: Create a replication slot.
3) On the subscriber: Create a subscription using the slot created by
the publisher.
4) On the publisher:
4.a) Session 1: BEGIN; INSERT INTO T1;
4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
4.c) Session 1: COMMIT;

Since we are throwing out a "publication does not exist" error, there
is no inconsistency issue here.

However, an issue persists with DROP ALL TABLES publication, where
data continues to replicate even after the publication is dropped.
This happens because the open transaction consumes the invalidation,
causing the publications to be revalidated using old snapshot. As a
result, both the open transactions and the subsequent transactions are
getting replicated.

We can reproduce this issue by following these steps in a logical
replication setup with an "ALL TABLES" publication:
On the publisher:
Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
In another session on the publisher:
Session 2: DROP PUBLICATION
Back in Session 1 on the publisher:
COMMIT;
Finally, in Session 1 on the publisher:
INSERT INTO T1 VALUES (val2);

Even after dropping the publication, both val1 and val2 are still
being replicated to the subscriber. This means that both the
in-progress concurrent transaction and the subsequent transactions are
being replicated.

I don't think locking all tables is a viable solution in this case, as
it would require asking the user to refrain from performing any
operations on any of the tables in the database while creating a
publication.

Indeed, locking all tables in the database to prevent concurrent DMLs
for this scenario also looks odd to me. The other alternative
previously suggested by Andres is to distribute catalog modifying
transactions to all concurrent in-progress transactions [1] but as
mentioned this could add an overhead. One possibility to reduce
overhead is that we selectively distribute invalidations for
catalogs-related publications but I haven't analyzed the feasibility.

We need more opinions to decide here, so let me summarize the problem
and solutions discussed. As explained with an example in an email [1],
the problem related to logical decoding is that it doesn't process
invalidations corresponding to DDLs for the already in-progress
transactions. We discussed preventing DMLs in the first place when
concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in
progress. The solution discussed was to acquire
ShareUpdateExclusiveLock for all the tables being added via such
commands. Further analysis revealed that the same handling is required
for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all
the tables in the specified schemas. Then DROP PUBLICATION also seems
to have similar symptoms which means in the worst case (where
publication is for ALL TABLES) we have to lock all the tables in the
database. We are not sure if that is good so the other alternative we
can pursue is to distribute invalidations in logical decoding
infrastructure [1] which has its downsides.

Thoughts?

Thank you for summarizing the problem and solutions!

I think it's worth trying the idea of distributing invalidation
messages, and we will see if there could be overheads or any further
obstacles. IIUC this approach would resolve another issue we discussed
before too[1].

Yes, and we also discussed having a similar solution at the time when
that problem was reported. So, it is clear that even though locking
tables can work for commands alter ALTER PUBLICATION ... ADD TABLE
..., we need a solution for distributing invalidations to the
in-progress transactions during logical decoding for other cases as
reported by you previously.

Thanks for looking into this.

Thanks, I am working on to implement a solution for distributing
invalidations. Will share a patch for the same.

Thanks and Regards,
Shlok Kyal

#40

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Shlok Kyal (#39)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, 31 Jul 2024 at 11:17, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

On Wed, 31 Jul 2024 at 09:36, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 31, 2024 at 3:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

BTW, I noticed that we don't take any table-level locks for Create
Publication .. For ALL TABLES (and Drop Publication). Can that create
a similar problem? I haven't tested so not sure but even if there is a
problem for the Create case, it should lead to some ERROR like missing
publication.

I tested these scenarios, and as you expected, it throws an error for
the create publication case:
2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
data from WAL stream: ERROR: publication "pub1" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change
callback, associated LSN 0/1510CD8
2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
"logical replication apply worker" (PID 481526) exited with exit code
1

The steps for this process are as follows:
1) Create tables in both the publisher and subscriber.
2) On the publisher: Create a replication slot.
3) On the subscriber: Create a subscription using the slot created by
the publisher.
4) On the publisher:
4.a) Session 1: BEGIN; INSERT INTO T1;
4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
4.c) Session 1: COMMIT;

Since we are throwing out a "publication does not exist" error, there
is no inconsistency issue here.

However, an issue persists with DROP ALL TABLES publication, where
data continues to replicate even after the publication is dropped.
This happens because the open transaction consumes the invalidation,
causing the publications to be revalidated using old snapshot. As a
result, both the open transactions and the subsequent transactions are
getting replicated.

We can reproduce this issue by following these steps in a logical
replication setup with an "ALL TABLES" publication:
On the publisher:
Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
In another session on the publisher:
Session 2: DROP PUBLICATION
Back in Session 1 on the publisher:
COMMIT;
Finally, in Session 1 on the publisher:
INSERT INTO T1 VALUES (val2);

Even after dropping the publication, both val1 and val2 are still
being replicated to the subscriber. This means that both the
in-progress concurrent transaction and the subsequent transactions are
being replicated.

I don't think locking all tables is a viable solution in this case, as
it would require asking the user to refrain from performing any
operations on any of the tables in the database while creating a
publication.

Indeed, locking all tables in the database to prevent concurrent DMLs
for this scenario also looks odd to me. The other alternative
previously suggested by Andres is to distribute catalog modifying
transactions to all concurrent in-progress transactions [1] but as
mentioned this could add an overhead. One possibility to reduce
overhead is that we selectively distribute invalidations for
catalogs-related publications but I haven't analyzed the feasibility.

We need more opinions to decide here, so let me summarize the problem
and solutions discussed. As explained with an example in an email [1],
the problem related to logical decoding is that it doesn't process
invalidations corresponding to DDLs for the already in-progress
transactions. We discussed preventing DMLs in the first place when
concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in
progress. The solution discussed was to acquire
ShareUpdateExclusiveLock for all the tables being added via such
commands. Further analysis revealed that the same handling is required
for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all
the tables in the specified schemas. Then DROP PUBLICATION also seems
to have similar symptoms which means in the worst case (where
publication is for ALL TABLES) we have to lock all the tables in the
database. We are not sure if that is good so the other alternative we
can pursue is to distribute invalidations in logical decoding
infrastructure [1] which has its downsides.

Thoughts?

Thank you for summarizing the problem and solutions!

I think it's worth trying the idea of distributing invalidation
messages, and we will see if there could be overheads or any further
obstacles. IIUC this approach would resolve another issue we discussed
before too[1].

Yes, and we also discussed having a similar solution at the time when
that problem was reported. So, it is clear that even though locking
tables can work for commands alter ALTER PUBLICATION ... ADD TABLE
..., we need a solution for distributing invalidations to the
in-progress transactions during logical decoding for other cases as
reported by you previously.

Thanks for looking into this.

Thanks, I am working on to implement a solution for distributing
invalidations. Will share a patch for the same.

Created a patch for distributing invalidations.
Here we collect the invalidation messages for the current transaction
and distribute it to all the inprogress transactions, whenever we are
distributing the snapshots..Thoughts?

Thanks and Regards,
Shlok Kyal

Attachments:

v7-0001-Distribute-invalidatons-if-change-in-catalog-tabl.patchapplication/octet-stream; name=v7-0001-Distribute-invalidatons-if-change-in-catalog-tabl.patchDownload

From 1f154f4d84a413a6fa490b28c6f2a67a8a697647 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Thu, 1 Aug 2024 12:16:24 +0530
Subject: [PATCH v7] Distribute invalidatons if change in catalog tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |  36 +++++
 src/backend/replication/logical/snapbuild.c   |  26 +++-
 src/include/replication/reorderbuffer.h       |   7 +
 src/test/subscription/t/100_bugs.pl           | 131 ++++++++++++++++++
 4 files changed, 197 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 00a8327e77..28694370ea 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -5308,3 +5308,39 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Get a list of invalidation messages in current committed transaction
+ */
+List *
+GetInvalidationMsg(ReorderBuffer *rb, XLogRecPtr lsn, TransactionId xid)
+{
+	List	   *invalmsgs = NIL;
+	ReorderBufferTXN *txn;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	ReorderBufferChange *change;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+	ReorderBufferIterTXNInit(rb, txn, &iterstate);
+
+	while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
+	{
+		if (change->action == REORDER_BUFFER_CHANGE_INVALIDATION)
+		{
+			InvalidationMsg *invalmsg = (InvalidationMsg *) palloc(sizeof(InvalidationMsg));
+
+			invalmsg->nmsgs = change->data.inval.ninvalidations;
+			invalmsg->msgs = (SharedInvalidationMessage *) palloc(sizeof(SharedInvalidationMessage) * invalmsg->nmsgs);
+			memcpy(invalmsg->msgs, change->data.inval.invalidations, sizeof(SharedInvalidationMessage) * invalmsg->nmsgs);
+
+			invalmsgs = lappend(invalmsgs, invalmsg);
+		}
+	}
+
+	/* clean up the iterator */
+
+	ReorderBufferIterTXNFinish(rb, iterstate);
+	iterstate = NULL;
+
+	return invalmsgs;
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ae676145e6..d79b380699 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -300,7 +300,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, List *invalmsgs);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -867,7 +867,7 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
  * contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, List *invalidmsgs)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
@@ -913,6 +913,19 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction
+		 */
+		if (txn->xid != xid)
+		{
+			foreach_ptr(InvalidationMsg, invalmsg, invalidmsgs)
+			{
+				ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+											  invalmsg->nmsgs, invalmsg->msgs);
+			}
+		}
 	}
 }
 
@@ -1156,6 +1169,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	/* if there's any reason to build a historic snapshot, do so now */
 	if (needs_snapshot)
 	{
+		List	   *invalmsgs;
+
 		/*
 		 * If we haven't built a complete snapshot yet there's no need to hand
 		 * it out, it wouldn't (and couldn't) be used anyway.
@@ -1184,8 +1199,13 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
+		/* get invalidation messages from reorder buffer */
+		invalmsgs = GetInvalidationMsg(builder->reorder, lsn, xid);
+
 		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		SnapBuildDistributeNewCatalogSnapshot(builder, lsn, xid, invalmsgs);
+
+		list_free_deep(invalmsgs);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 851a001c8b..7e2d5d9661 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -664,6 +664,11 @@ struct ReorderBuffer
 	int64		totalBytes;		/* total amount of data decoded */
 };
 
+typedef struct InvalidationMsg
+{
+	uint32		nmsgs;
+	SharedInvalidationMessage *msgs;
+}			InvalidationMsg;
 
 extern ReorderBuffer *ReorderBufferAllocate(void);
 extern void ReorderBufferFree(ReorderBuffer *rb);
@@ -740,4 +745,6 @@ extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
 extern void StartupReorderBuffer(void);
 
+extern List *GetInvalidationMsg(ReorderBuffer *rb, XLogRecPtr lsn, TransactionId xid);
+
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..82497c9d11 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,137 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', qq(DROP  PUBLICATION pub1;));
+$node_subscriber->safe_psql('postgres', qq(DROP  SUBSCRIPTION sub1;));
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SCHEMA sch4;
+	CREATE TABLE sch4.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SCHEMA sch4;
+	CREATE TABLE sch4.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will wait for the lock and can only be completed after
+$background_psql3->query_until(qr//,
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch4, sch3;\n"
+);
+
+# Complete the transaction on the tables, so that ALTER PUBLICATION can proceed
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.34.1

#41

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Shlok Kyal (#40)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Thu, 8 Aug 2024 at 16:24, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

On Wed, 31 Jul 2024 at 11:17, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

On Wed, 31 Jul 2024 at 09:36, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 31, 2024 at 3:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

BTW, I noticed that we don't take any table-level locks for Create
Publication .. For ALL TABLES (and Drop Publication). Can that create
a similar problem? I haven't tested so not sure but even if there is a
problem for the Create case, it should lead to some ERROR like missing
publication.

I tested these scenarios, and as you expected, it throws an error for
the create publication case:
2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
data from WAL stream: ERROR: publication "pub1" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change
callback, associated LSN 0/1510CD8
2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
"logical replication apply worker" (PID 481526) exited with exit code
1

The steps for this process are as follows:
1) Create tables in both the publisher and subscriber.
2) On the publisher: Create a replication slot.
3) On the subscriber: Create a subscription using the slot created by
the publisher.
4) On the publisher:
4.a) Session 1: BEGIN; INSERT INTO T1;
4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
4.c) Session 1: COMMIT;

Since we are throwing out a "publication does not exist" error, there
is no inconsistency issue here.

However, an issue persists with DROP ALL TABLES publication, where
data continues to replicate even after the publication is dropped.
This happens because the open transaction consumes the invalidation,
causing the publications to be revalidated using old snapshot. As a
result, both the open transactions and the subsequent transactions are
getting replicated.

We can reproduce this issue by following these steps in a logical
replication setup with an "ALL TABLES" publication:
On the publisher:
Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
In another session on the publisher:
Session 2: DROP PUBLICATION
Back in Session 1 on the publisher:
COMMIT;
Finally, in Session 1 on the publisher:
INSERT INTO T1 VALUES (val2);

Even after dropping the publication, both val1 and val2 are still
being replicated to the subscriber. This means that both the
in-progress concurrent transaction and the subsequent transactions are
being replicated.

I don't think locking all tables is a viable solution in this case, as
it would require asking the user to refrain from performing any
operations on any of the tables in the database while creating a
publication.

Indeed, locking all tables in the database to prevent concurrent DMLs
for this scenario also looks odd to me. The other alternative
previously suggested by Andres is to distribute catalog modifying
transactions to all concurrent in-progress transactions [1] but as
mentioned this could add an overhead. One possibility to reduce
overhead is that we selectively distribute invalidations for
catalogs-related publications but I haven't analyzed the feasibility.

We need more opinions to decide here, so let me summarize the problem
and solutions discussed. As explained with an example in an email [1],
the problem related to logical decoding is that it doesn't process
invalidations corresponding to DDLs for the already in-progress
transactions. We discussed preventing DMLs in the first place when
concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in
progress. The solution discussed was to acquire
ShareUpdateExclusiveLock for all the tables being added via such
commands. Further analysis revealed that the same handling is required
for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all
the tables in the specified schemas. Then DROP PUBLICATION also seems
to have similar symptoms which means in the worst case (where
publication is for ALL TABLES) we have to lock all the tables in the
database. We are not sure if that is good so the other alternative we
can pursue is to distribute invalidations in logical decoding
infrastructure [1] which has its downsides.

Thoughts?

Thank you for summarizing the problem and solutions!

I think it's worth trying the idea of distributing invalidation
messages, and we will see if there could be overheads or any further
obstacles. IIUC this approach would resolve another issue we discussed
before too[1].

Yes, and we also discussed having a similar solution at the time when
that problem was reported. So, it is clear that even though locking
tables can work for commands alter ALTER PUBLICATION ... ADD TABLE
..., we need a solution for distributing invalidations to the
in-progress transactions during logical decoding for other cases as
reported by you previously.

Thanks for looking into this.

Thanks, I am working on to implement a solution for distributing
invalidations. Will share a patch for the same.

Created a patch for distributing invalidations.
Here we collect the invalidation messages for the current transaction
and distribute it to all the inprogress transactions, whenever we are
distributing the snapshots..Thoughts?

In the v7 patch, I am looping through the reorder buffer of the
current committed transaction and storing all invalidation messages in
a list. Then I am distributing those invalidations.
But I found that for a transaction we already store all the
invalidation messages (see [1]https://github.com/postgres/postgres/blob/7da1bdc2c2f17038f2ae1900be90a0d7b5e361e0/src/include/replication/reorderbuffer.h#L384). So we don't need to loop through the
reorder buffer and store the invalidations.

I have modified the patch accordingly and attached the same.

[1]: https://github.com/postgres/postgres/blob/7da1bdc2c2f17038f2ae1900be90a0d7b5e361e0/src/include/replication/reorderbuffer.h#L384

Thanks and Regards,
Shlok Kyal

Attachments:

v8-0001-Distribute-invalidatons-if-change-in-catalog-tabl.patchapplication/octet-stream; name=v8-0001-Distribute-invalidatons-if-change-in-catalog-tabl.patchDownload

From 1e08c53164bc37736b5cdb87f367215cdbfbae84 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Thu, 1 Aug 2024 12:16:24 +0530
Subject: [PATCH v8] Distribute invalidatons if change in catalog tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  34 +++--
 src/include/replication/reorderbuffer.h       |   4 +
 src/test/subscription/t/100_bugs.pl           | 131 ++++++++++++++++++
 4 files changed, 160 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 00a8327e77..2028e081e3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -221,9 +221,6 @@ int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED
  */
 static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
 static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -619,7 +616,7 @@ ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ae676145e6..fa8871d3d0 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -300,7 +300,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -859,18 +859,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -913,6 +916,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1184,8 +1195,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeNewCatalogSnapshot(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 851a001c8b..95eda20128 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -738,6 +738,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
+											   TransactionId xid, bool create, bool *is_new,
+											   XLogRecPtr lsn, bool create_as_top);
+
 extern void StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..82497c9d11 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,137 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', qq(DROP  PUBLICATION pub1;));
+$node_subscriber->safe_psql('postgres', qq(DROP  SUBSCRIPTION sub1;));
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SCHEMA sch4;
+	CREATE TABLE sch4.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SCHEMA sch4;
+	CREATE TABLE sch4.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will wait for the lock and can only be completed after
+$background_psql3->query_until(qr//,
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch4, sch3;\n"
+);
+
+# Complete the transaction on the tables, so that ALTER PUBLICATION can proceed
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.34.1

#42

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: Shlok Kyal (#40)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Thu, 8 Aug 2024 at 16:24, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

On Wed, 31 Jul 2024 at 11:17, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

On Wed, 31 Jul 2024 at 09:36, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 31, 2024 at 3:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

BTW, I noticed that we don't take any table-level locks for Create
Publication .. For ALL TABLES (and Drop Publication). Can that create
a similar problem? I haven't tested so not sure but even if there is a
problem for the Create case, it should lead to some ERROR like missing
publication.

I tested these scenarios, and as you expected, it throws an error for
the create publication case:
2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
data from WAL stream: ERROR: publication "pub1" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change
callback, associated LSN 0/1510CD8
2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
"logical replication apply worker" (PID 481526) exited with exit code
1

The steps for this process are as follows:
1) Create tables in both the publisher and subscriber.
2) On the publisher: Create a replication slot.
3) On the subscriber: Create a subscription using the slot created by
the publisher.
4) On the publisher:
4.a) Session 1: BEGIN; INSERT INTO T1;
4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
4.c) Session 1: COMMIT;

Since we are throwing out a "publication does not exist" error, there
is no inconsistency issue here.

However, an issue persists with DROP ALL TABLES publication, where
data continues to replicate even after the publication is dropped.
This happens because the open transaction consumes the invalidation,
causing the publications to be revalidated using old snapshot. As a
result, both the open transactions and the subsequent transactions are
getting replicated.

We can reproduce this issue by following these steps in a logical
replication setup with an "ALL TABLES" publication:
On the publisher:
Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
In another session on the publisher:
Session 2: DROP PUBLICATION
Back in Session 1 on the publisher:
COMMIT;
Finally, in Session 1 on the publisher:
INSERT INTO T1 VALUES (val2);

Even after dropping the publication, both val1 and val2 are still
being replicated to the subscriber. This means that both the
in-progress concurrent transaction and the subsequent transactions are
being replicated.

I don't think locking all tables is a viable solution in this case, as
it would require asking the user to refrain from performing any
operations on any of the tables in the database while creating a
publication.

Indeed, locking all tables in the database to prevent concurrent DMLs
for this scenario also looks odd to me. The other alternative
previously suggested by Andres is to distribute catalog modifying
transactions to all concurrent in-progress transactions [1] but as
mentioned this could add an overhead. One possibility to reduce
overhead is that we selectively distribute invalidations for
catalogs-related publications but I haven't analyzed the feasibility.

We need more opinions to decide here, so let me summarize the problem
and solutions discussed. As explained with an example in an email [1],
the problem related to logical decoding is that it doesn't process
invalidations corresponding to DDLs for the already in-progress
transactions. We discussed preventing DMLs in the first place when
concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in
progress. The solution discussed was to acquire
ShareUpdateExclusiveLock for all the tables being added via such
commands. Further analysis revealed that the same handling is required
for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all
the tables in the specified schemas. Then DROP PUBLICATION also seems
to have similar symptoms which means in the worst case (where
publication is for ALL TABLES) we have to lock all the tables in the
database. We are not sure if that is good so the other alternative we
can pursue is to distribute invalidations in logical decoding
infrastructure [1] which has its downsides.

Thoughts?

Thank you for summarizing the problem and solutions!

I think it's worth trying the idea of distributing invalidation
messages, and we will see if there could be overheads or any further
obstacles. IIUC this approach would resolve another issue we discussed
before too[1].

Yes, and we also discussed having a similar solution at the time when
that problem was reported. So, it is clear that even though locking
tables can work for commands alter ALTER PUBLICATION ... ADD TABLE
..., we need a solution for distributing invalidations to the
in-progress transactions during logical decoding for other cases as
reported by you previously.

Thanks for looking into this.

Thanks, I am working on to implement a solution for distributing
invalidations. Will share a patch for the same.

Created a patch for distributing invalidations.
Here we collect the invalidation messages for the current transaction
and distribute it to all the inprogress transactions, whenever we are
distributing the snapshots..Thoughts?

Since we are applying invalidations to all in-progress transactions,
the publisher will only replicate half of the transaction data up to
the point of invalidation, while the remaining half will not be
replicated.
Ex:
Session1:
BEGIN;
INSERT INTO tab_conc VALUES (1);

Session2:
ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc;

Session1:
INSERT INTO tab_conc VALUES (2);
INSERT INTO tab_conc VALUES (3);
COMMIT;

After the above the subscriber data looks like:
postgres=# select * from tab_conc ;
a
---
1
(1 row)

You can reproduce the issue using the attached test.
I'm not sure if this behavior is ok. At present, we’ve replicated the
first record within the same transaction, but the second and third
records are being skipped. Would it be better to apply invalidations
after the transaction is underway?
Thoughts?

Regards,
Vignesh

Attachments:

test_issue_reproduce.patchtext/x-patch; charset=US-ASCII; name=test_issue_reproduce.patchDownload

diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..b5d9749bdf 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,145 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# cleanpup
+$node_publisher->safe_psql('postgres', qq(DROP  PUBLICATION pub1;));
+$node_subscriber->safe_psql('postgres', qq(DROP  SUBSCRIPTION sub1;));
+
+# =============================================================================
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+# =============================================================================
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate a background session that keeps a transaction active.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+# Maintain an active transaction with the table.
+$background_psql1->set_query_timer_restart();
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will wait for the lock and can only be completed after
+# the previous open transaction is committed.
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+# This operation will wait because there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc;\n");
+
+# Complete the old transaction.
+$background_psql1->query_safe(qq[COMMIT]);
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# =============================================================================
+# This bug is present with ALTER PUBLICATION ... DROP TABLE.
+# =============================================================================
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+]);
+
+# wait for WAL to be generated
+sleep(1);
+
+# This operation will wait because there is an open transaction holding a lock.
+$background_psql2->query_until(qr//,
+	"ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc;\n");
+
+# wait for WAL to be generated
+sleep(1);
+
+# Complete the old transaction.
+$background_psql1->query_safe(
+	qq[
+	INSERT INTO tab_conc VALUES (5);
+	INSERT INTO tab_conc VALUES (6);
+	COMMIT;
+]);
+#$background_psql1->query_safe(qq[COMMIT]);
+$background_psql1->quit;
+
+# Wait till the tables are dropped from the publication.
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(1) = 0 FROM pg_publication_rel WHERE prrelid IN ('tab_conc'::regclass);"
+  )
+  or die
+  "Timed out while waiting for alter publication to add the table to the publication";
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert before drop table is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4
+5
+6),
+	'Verify that the incremental data for table tab_conc before removing table from publication is replicated to the subscriber'
+);
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');

#43

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: vignesh C (#42)

Re: long-standing data loss bug in initial sync of logical replication

On Thu, Aug 15, 2024 at 9:31 PM vignesh C <vignesh21@gmail.com> wrote:

On Thu, 8 Aug 2024 at 16:24, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

On Wed, 31 Jul 2024 at 11:17, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

Created a patch for distributing invalidations.
Here we collect the invalidation messages for the current transaction
and distribute it to all the inprogress transactions, whenever we are
distributing the snapshots..Thoughts?

Since we are applying invalidations to all in-progress transactions,
the publisher will only replicate half of the transaction data up to
the point of invalidation, while the remaining half will not be
replicated.
Ex:
Session1:
BEGIN;
INSERT INTO tab_conc VALUES (1);

Session2:
ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc;

Session1:
INSERT INTO tab_conc VALUES (2);
INSERT INTO tab_conc VALUES (3);
COMMIT;

After the above the subscriber data looks like:
postgres=# select * from tab_conc ;
a
---
1
(1 row)

You can reproduce the issue using the attached test.
I'm not sure if this behavior is ok. At present, we’ve replicated the
first record within the same transaction, but the second and third
records are being skipped.

This can happen even without a concurrent DDL if some of the tables in
the database are part of the publication and others are not. In such a
case inserts for publicized tables will be replicated but other
inserts won't. Sending the partial data of the transaction isn't a
problem to me. Do you have any other concerns that I am missing?

Would it be better to apply invalidations
after the transaction is underway?

But that won't fix the problem reported by Sawada-san in an email [1]/messages/by-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com.

BTW, we should do some performance testing by having a mix of DML and
DDLs to see the performance impact of this patch.

[1]: /messages/by-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com

--
With Regards,
Amit Kapila.

#44

vignesh C

vignesh21@gmail.com

over 1 year ago

In reply to: Amit Kapila (#43)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, 20 Aug 2024 at 16:10, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 15, 2024 at 9:31 PM vignesh C <vignesh21@gmail.com> wrote:

On Thu, 8 Aug 2024 at 16:24, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

On Wed, 31 Jul 2024 at 11:17, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

Created a patch for distributing invalidations.
Here we collect the invalidation messages for the current transaction
and distribute it to all the inprogress transactions, whenever we are
distributing the snapshots..Thoughts?

Since we are applying invalidations to all in-progress transactions,
the publisher will only replicate half of the transaction data up to
the point of invalidation, while the remaining half will not be
replicated.
Ex:
Session1:
BEGIN;
INSERT INTO tab_conc VALUES (1);

Session2:
ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc;

Session1:
INSERT INTO tab_conc VALUES (2);
INSERT INTO tab_conc VALUES (3);
COMMIT;

After the above the subscriber data looks like:
postgres=# select * from tab_conc ;
a
---
1
(1 row)

You can reproduce the issue using the attached test.
I'm not sure if this behavior is ok. At present, we’ve replicated the
first record within the same transaction, but the second and third
records are being skipped.

This can happen even without a concurrent DDL if some of the tables in
the database are part of the publication and others are not. In such a
case inserts for publicized tables will be replicated but other
inserts won't. Sending the partial data of the transaction isn't a
problem to me. Do you have any other concerns that I am missing?

My main concern was about sending only part of the data from a
transaction table and leaving out the rest. However, since this is
happening elsewhere as well, I'm okay with it.

Regards,
Vignesh

#45

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Amit Kapila (#43)

4 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

BTW, we should do some performance testing by having a mix of DML and
DDLs to see the performance impact of this patch.

[1] - /messages/by-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com

I did some performance testing and I found some performance impact for
the following case:

1. Created a publisher, subscriber set up on a single table, say 'tab_conc1';
2. Created a second publisher, subscriber set on a single table say 'tp';
3. Created 'tcount' no. of tables. These tables are not part of any publication.
4. There are two sessions running in parallel, let's say S1 and S2.
5. Begin a transaction in S1.
6. Now in a loop (this loop runs 100 times):
S1: Insert a row in table 'tab_conc1'
S1: Insert a row in all 'tcount' tables.
S2: BEGIN; Alter publication for 2nd publication; COMMIT;
The current logic in the patch will call the function
'rel_sync_cache_publication_cb' during invalidation. This will
invalidate the cache for all the tables. So cache related to all the
tables i.e. table 'tab_conc1', 'tcount' tables will be invalidated.
7. COMMIT the transaction in S1.

The performance in this case is:
No. of tables | With patch (in ms) | With head (in ms)
-----------------------------------------------------------------------------
tcount = 100 | 101376.4 | 101357.8
tcount = 1000 | 994085.4 | 993471.4

For 100 tables the performance is slow by '0.018%' and for 1000 tables
performance is slow by '0.06%'.
These results are the average of 5 runs.

Other than this I tested the following cases but did not find any
performance impact:
1. with 'tcount = 10'. But I didn't find any performance impact.
2. with 'tcount = 0' and running the loop 1000 times. But I didn't
find any performance impact.

I have also attached the test script and the machine configurations on
which performance testing was done.
Next I am planning to test solely on the logical decoding side and
will share the results.

Thanks and Regards,
Shlok Kyal

#46

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Shlok Kyal (#45)

Re: long-standing data loss bug in initial sync of logical replication

On Fri, Aug 30, 2024 at 3:06 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

Next I am planning to test solely on the logical decoding side and
will share the results.

Thanks, the next set of proposed tests makes sense to me. It will also
be useful to generate some worst-case scenarios where the number of
invalidations is more to see the distribution cost in such cases. For
example, Truncate/Drop a table with 100 or 1000 partitions.

--
With Regards,
Amit Kapila.

#47

Nitin Motiani

nitinmotiani@google.com

over 1 year ago

In reply to: Amit Kapila (#43)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, Aug 20, 2024 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 15, 2024 at 9:31 PM vignesh C <vignesh21@gmail.com> wrote:

Since we are applying invalidations to all in-progress transactions,
the publisher will only replicate half of the transaction data up to
the point of invalidation, while the remaining half will not be
replicated.
Ex:
Session1:
BEGIN;
INSERT INTO tab_conc VALUES (1);

Session2:
ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc;

Session1:
INSERT INTO tab_conc VALUES (2);
INSERT INTO tab_conc VALUES (3);
COMMIT;

After the above the subscriber data looks like:
postgres=# select * from tab_conc ;
a
---
1
(1 row)

You can reproduce the issue using the attached test.
I'm not sure if this behavior is ok. At present, we’ve replicated the
first record within the same transaction, but the second and third
records are being skipped.

This can happen even without a concurrent DDL if some of the tables in
the database are part of the publication and others are not. In such a
case inserts for publicized tables will be replicated but other
inserts won't. Sending the partial data of the transaction isn't a
problem to me. Do you have any other concerns that I am missing?

Hi,

I think that the partial data replication for one table is a bigger
issue than the case of data being sent for a subset of the tables in
the transaction. This can lead to inconsistent data if the same row is
updated multiple times or deleted in the same transaction. In such a
case if only the partial updates from the transaction are sent to the
subscriber, it might end up with the data which was never visible on
the publisher side.

Here is an example I tried with the patch v8-001 :

I created following 2 tables on the publisher and the subscriber :

CREATE TABLE delete_test(id int primary key, name varchar(100));
CREATE TABLE update_test(id int primary key, name varchar(100));

I added both the tables to the publication p on the publisher and
created a subscription s on the subscriber.

I run 2 sessions on the publisher and do the following :

Session 1 :
BEGIN;
INSERT INTO delete_test VALUES(0, 'Nitin');

Session 2 :
ALTER PUBLICATION p DROP TABLE delete_test;

Session 1 :
DELETE FROM delete_test WHERE id=0;
COMMIT;

After the commit there should be no new row created on the publisher.
But because the partial data was replicated, this is what the select
on the subscriber shows :

SELECT * FROM delete_test;
id | name
----+-----------
0 | Nitin
(1 row)

I don't think the above is a common use case. But this is still an
issue because the subscriber has the data which never existed on the
publisher.

Similar issue can be seen with an update command.

Session 1 :
BEGIN;
INSERT INTO update_test VALUES(1, 'Chiranjiv');

Session 2 :
ALTER PUBLICATION p DROP TABLE update_test;

Session 1:
UPDATE update_test SET name='Eeshan' where id=1;
COMMIT;

After the commit, this is the state on the publisher :
SELECT * FROM update_test;
1 | Eeshan
(1 row)

While this is the state on the subscriber :
SELECT * FROM update_test;
1 | Chiranjiv
(1 row)

I think the update during a transaction scenario might be more common
than deletion right after insertion. But both of these seem like real
issues to consider. Please let me know if I'm missing something.

Thanks & Regards
Nitin Motiani
Google

#48

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Nitin Motiani (#47)

Re: long-standing data loss bug in initial sync of logical replication

On Mon, Sep 2, 2024 at 9:19 PM Nitin Motiani <nitinmotiani@google.com> wrote:

I think that the partial data replication for one table is a bigger
issue than the case of data being sent for a subset of the tables in
the transaction. This can lead to inconsistent data if the same row is
updated multiple times or deleted in the same transaction. In such a
case if only the partial updates from the transaction are sent to the
subscriber, it might end up with the data which was never visible on
the publisher side.

Here is an example I tried with the patch v8-001 :

I created following 2 tables on the publisher and the subscriber :

CREATE TABLE delete_test(id int primary key, name varchar(100));
CREATE TABLE update_test(id int primary key, name varchar(100));

I added both the tables to the publication p on the publisher and
created a subscription s on the subscriber.

I run 2 sessions on the publisher and do the following :

Session 1 :
BEGIN;
INSERT INTO delete_test VALUES(0, 'Nitin');

Session 2 :
ALTER PUBLICATION p DROP TABLE delete_test;

Session 1 :
DELETE FROM delete_test WHERE id=0;
COMMIT;

After the commit there should be no new row created on the publisher.
But because the partial data was replicated, this is what the select
on the subscriber shows :

SELECT * FROM delete_test;
id | name
----+-----------
0 | Nitin
(1 row)

I don't think the above is a common use case. But this is still an
issue because the subscriber has the data which never existed on the
publisher.

I don't think that is the correct conclusion because the user has
intentionally avoided sending part of the transaction changes. This
can happen in various ways without the patch as well. For example, if
the user has performed the ALTER in the same transaction.

Publisher:
=========
BEGIN
postgres=*# Insert into delete_test values(0, 'Nitin');
INSERT 0 1
postgres=*# Alter Publication pub1 drop table delete_test;
ALTER PUBLICATION
postgres=*# Delete from delete_test where id=0;
DELETE 1
postgres=*# commit;
COMMIT
postgres=# select * from delete_test;
id | name
----+------
(0 rows)

Subscriber:
=========
postgres=# select * from delete_test;
id | name
----+-------
0 | Nitin
(1 row)

This can also happen when the user has published only 'inserts' but
not 'updates' or 'deletes'.

--
With Regards,
Amit Kapila.

#49

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Amit Kapila (#46)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Mon, 2 Sept 2024 at 10:12, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 30, 2024 at 3:06 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

Next I am planning to test solely on the logical decoding side and
will share the results.

Thanks, the next set of proposed tests makes sense to me. It will also
be useful to generate some worst-case scenarios where the number of
invalidations is more to see the distribution cost in such cases. For
example, Truncate/Drop a table with 100 or 1000 partitions.

--
With Regards,
Amit Kapila.

Hi,

I did some performance testing solely on the logical decoding side and
found some degradation in performance, for the following testcase:
1. Created a publisher on a single table, say 'tab_conc1';
2. Created a second publisher on a single table say 'tp';
4. two sessions are running in parallel, let's say S1 and S2.
5. Begin a transaction in S1.
6. Now in a loop (this loop runs 'count' times):
S1: Insert a row in table 'tab_conc1'
S2: BEGIN; Alter publication DROP/ ADD tp; COMMIT
7. COMMIT the transaction in S1.
8. run 'pg_logical_slot_get_binary_changes' to get the decoding changes.

Observation:
With fix a new entry is added in decoding. During debugging I found
that this entry only comes when we do a 'INSERT' in Session 1 after we
do 'ALTER PUBLICATION' in another session in parallel (or we can say
due to invalidation). Also, I observed that this new entry is related
to sending replica identity, attributes,etc as function
'logicalrep_write_rel' is called.

Performance:
We see a performance degradation as we are sending new entries during
logical decoding. Results are an average of 5 runs.

count | Head (sec) | Fix (sec) | Degradation (%)
------------------------------------------------------------------------------
10000 | 1.298 | 1.574 | 21.26348228
50000 | 22.892 | 24.997 | 9.195352088
100000 | 88.602 | 93.759 | 5.820410374

I have also attached the test script here.

Thanks and Regards,
Shlok Kyal

#50

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Amit Kapila (#46)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Mon, 2 Sept 2024 at 10:12, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 30, 2024 at 3:06 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

Next I am planning to test solely on the logical decoding side and
will share the results.

Thanks, the next set of proposed tests makes sense to me. It will also
be useful to generate some worst-case scenarios where the number of
invalidations is more to see the distribution cost in such cases. For
example, Truncate/Drop a table with 100 or 1000 partitions.

--
With Regards,
Amit Kapila.

Also, I did testing with a table with partitions. To test for the
scenario where the number of invalidations are more than distribution.
Following is the test case:
1. Created a publisher on a single table, say 'tconc_1';
2. Created a second publisher on a partition table say 'tp';
3. Created 'tcount' partitions for the table 'tp'.
4. two sessions are running in parallel, let's say S1 and S2.
5. Begin a transaction in S1.
6. S1: Insert a row in table 'tconc_1'
S2: BEGIN; TRUNCATE TABLE tp; COMMIT;
With patch, this will add 'tcount * 3' invalidation messages to
transaction in session 1.
S1: Insert a row in table 't_conc1'
7. COMMIT the transaction in S1.
8. run 'pg_logical_slot_get_binary_changes' to get the decoding changes.

Performance:
We see a degradation in performance. Results are an average of 5 runs.

count of partitions | Head (sec) | Fix (sec) | Degradation (%)
-------------------------------------------------------------------------------------
1000 | 0.114 | 0.118 | 3.50877193
5000 | 0.502 | 0.522 | 3.984063745
10000 | 1.012 | 1.024 | 1.185770751

I have also attached the test script here. And will also do further testing.

Thanks and Regards,
Shlok Kyal

#51

Zhijie Hou (Fujitsu)

houzj.fnst@fujitsu.com

over 1 year ago

In reply to: Shlok Kyal (#41)

1 attachment(s)

RE: long-standing data loss bug in initial sync of logical replication

On Friday, August 9, 2024 7:21 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

Hi,

In the v7 patch, I am looping through the reorder buffer of the current committed
transaction and storing all invalidation messages in a list. Then I am
distributing those invalidations.
But I found that for a transaction we already store all the invalidation messages
(see [1]). So we don't need to loop through the reorder buffer and store the
invalidations.

I have modified the patch accordingly and attached the same.

I have tested this patch across various scenarios and did not find issues.

I confirmed that changes are correctly replicated after adding the table or
schema to the publication, and changes will not be replicated after removing
the table or schema from the publication. This behavior is consistent in both
streaming and non-streaming modes. Additionally, I verified that invalidations
occurring within subtransactions are appropriately distributed.

Please refer to the attached ISOLATION tests which tested the above cases.
This also inspires me if it would be cheaper to write an ISOLATION test for this
bug instead of building a real pub/sub cluster. But I am not against the current
tests in the V8 patch as that can check the replicated data in a visible way.

Best Regards,
Hou zj

Attachments:

0001-test-invalidation-distribution.patch.txttext/plain; name=0001-test-invalidation-distribution.patch.txtDownload

From 8f4e36c5fc65d4a88058467a73cbe423a5f0e91e Mon Sep 17 00:00:00 2001
From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
Date: Mon, 9 Sep 2024 19:56:18 +0800
Subject: [PATCH] test invalidation distribution

---
 contrib/test_decoding/Makefile                |   2 +-
 .../expected/invalidation_distrubution.out    | 173 ++++++++++++++++++
 .../specs/invalidation_distrubution.spec      |  56 ++++++
 3 files changed, 230 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/invalidation_distrubution.out
 create mode 100644 contrib/test_decoding/specs/invalidation_distrubution.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index a4ba1a509a..eef7077067 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot \
-	skip_snapshot_restore
+	skip_snapshot_restore invalidation_distrubution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distrubution.out
new file mode 100644
index 0000000000..cdc871d31d
--- /dev/null
+++ b/contrib/test_decoding/expected/invalidation_distrubution.out
@@ -0,0 +1,173 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_init s2_create_pub s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_alter_pub_add_tbl s1_commit s1_insert_tbl1 s2_get_binary_changes s2_drop_pub
+step s1_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+?column?
+--------
+init    
+(1 row)
+
+step s2_create_pub: CREATE PUBLICATION pub;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+step s2_drop_pub: DROP PUBLICATION pub;
+?column?
+--------
+stop    
+(1 row)
+
+
+starting permutation: s1_init s2_create_pub s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_alter_pub_add_schema s1_commit s1_insert_tbl1 s2_get_binary_changes s2_drop_pub
+step s1_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+?column?
+--------
+init    
+(1 row)
+
+step s2_create_pub: CREATE PUBLICATION pub;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_schema: ALTER PUBLICATION pub ADD TABLES IN SCHEMA public;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+step s2_drop_pub: DROP PUBLICATION pub;
+?column?
+--------
+stop    
+(1 row)
+
+
+starting permutation: s1_init s2_create_pub s2_alter_pub_add_tbl s1_begin s1_insert_tbl1 s2_alter_pub_drop_tbl s1_commit s1_insert_tbl1 s2_get_binary_changes s2_drop_pub
+step s1_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+?column?
+--------
+init    
+(1 row)
+
+step s2_create_pub: CREATE PUBLICATION pub;
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_drop_tbl: ALTER PUBLICATION pub DROP TABLE tbl1;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+step s2_drop_pub: DROP PUBLICATION pub;
+?column?
+--------
+stop    
+(1 row)
+
+
+starting permutation: s1_init s2_create_pub s2_alter_pub_add_schema s1_begin s1_insert_tbl1 s2_alter_pub_drop_schema s1_commit s1_insert_tbl1 s2_get_binary_changes s2_drop_pub
+step s1_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+?column?
+--------
+init    
+(1 row)
+
+step s2_create_pub: CREATE PUBLICATION pub;
+step s2_alter_pub_add_schema: ALTER PUBLICATION pub ADD TABLES IN SCHEMA public;
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_drop_schema: ALTER PUBLICATION pub DROP TABLES IN SCHEMA public;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+step s2_drop_pub: DROP PUBLICATION pub;
+?column?
+--------
+stop    
+(1 row)
+
+
+starting permutation: s1_init s2_create_pub s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_begin s2_savepoint s2_alter_pub_add_tbl s2_commit s1_commit s1_insert_tbl1 s2_get_binary_changes s2_drop_pub
+step s1_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+?column?
+--------
+init    
+(1 row)
+
+step s2_create_pub: CREATE PUBLICATION pub;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_begin: BEGIN;
+step s2_savepoint: SAVEPOINT s1;
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s2_commit: COMMIT;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+step s2_drop_pub: DROP PUBLICATION pub;
+?column?
+--------
+stop    
+(1 row)
+
+
+starting permutation: s2_create_pub s1_init s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_set_streaming_mode s2_alter_pub_add_tbl s2_get_binary_stream_changes s1_insert_tbl1 s1_commit s2_get_binary_stream_changes s2_drop_pub
+step s2_create_pub: CREATE PUBLICATION pub;
+step s1_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+?column?
+--------
+init    
+(1 row)
+
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_set_streaming_mode: SET debug_logical_replication_streaming = immediate;
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s2_get_binary_stream_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub', 'streaming', 'on') WHERE get_byte(data, 0) = 73;
+count
+-----
+    0
+(1 row)
+
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_commit: COMMIT;
+step s2_get_binary_stream_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub', 'streaming', 'on') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+step s2_drop_pub: DROP PUBLICATION pub;
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distrubution.spec
new file mode 100644
index 0000000000..0d3d328250
--- /dev/null
+++ b/contrib/test_decoding/specs/invalidation_distrubution.spec
@@ -0,0 +1,56 @@
+setup
+{
+    DROP TABLE IF EXISTS tbl1;
+    CREATE TABLE tbl1(val1 integer, val2 integer);
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput'); }
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 (val1, val2) VALUES (1, 1); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2_begin" { BEGIN; }
+step "s2_savepoint" { SAVEPOINT s1; }
+step "s2_set_streaming_mode" { SET debug_logical_replication_streaming = immediate; }
+step "s2_create_pub" { CREATE PUBLICATION pub; }
+step "s2_alter_pub_add_tbl" { ALTER PUBLICATION pub ADD TABLE tbl1; }
+step "s2_alter_pub_drop_tbl" { ALTER PUBLICATION pub DROP TABLE tbl1; }
+step "s2_alter_pub_add_schema" { ALTER PUBLICATION pub ADD TABLES IN SCHEMA public; }
+step "s2_alter_pub_drop_schema" { ALTER PUBLICATION pub DROP TABLES IN SCHEMA public; }
+step "s2_drop_pub" { DROP PUBLICATION pub; }
+
+
+step "s2_get_binary_stream_changes" { SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub', 'streaming', 'on') WHERE get_byte(data, 0) = 73; }
+step "s2_commit" { COMMIT; }
+
+step "s2_get_binary_changes" { SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73; }
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_init" "s2_create_pub" "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_add_tbl" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes" "s2_drop_pub"
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_init" "s2_create_pub" "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_add_schema" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes" "s2_drop_pub"
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_init" "s2_create_pub" "s2_alter_pub_add_tbl" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_drop_tbl" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes" "s2_drop_pub"
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_init" "s2_create_pub" "s2_alter_pub_add_schema" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_drop_schema" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes" "s2_drop_pub"
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_init" "s2_create_pub" "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_begin" "s2_savepoint" "s2_alter_pub_add_tbl" "s2_commit" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes" "s2_drop_pub"
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s2_create_pub" "s1_init" "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_set_streaming_mode" "s2_alter_pub_add_tbl" "s2_get_binary_stream_changes" "s1_insert_tbl1" "s1_commit" "s2_get_binary_stream_changes" "s2_drop_pub"
-- 
2.30.0.windows.2

#52

Nitin Motiani

nitinmotiani@google.com

over 1 year ago

In reply to: Amit Kapila (#48)

Re: long-standing data loss bug in initial sync of logical replication

On Thu, Sep 5, 2024 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Sep 2, 2024 at 9:19 PM Nitin Motiani <nitinmotiani@google.com> wrote:

I think that the partial data replication for one table is a bigger
issue than the case of data being sent for a subset of the tables in
the transaction. This can lead to inconsistent data if the same row is
updated multiple times or deleted in the same transaction. In such a
case if only the partial updates from the transaction are sent to the
subscriber, it might end up with the data which was never visible on
the publisher side.

Here is an example I tried with the patch v8-001 :

I created following 2 tables on the publisher and the subscriber :

CREATE TABLE delete_test(id int primary key, name varchar(100));
CREATE TABLE update_test(id int primary key, name varchar(100));

I added both the tables to the publication p on the publisher and
created a subscription s on the subscriber.

I run 2 sessions on the publisher and do the following :

Session 1 :
BEGIN;
INSERT INTO delete_test VALUES(0, 'Nitin');

Session 2 :
ALTER PUBLICATION p DROP TABLE delete_test;

Session 1 :
DELETE FROM delete_test WHERE id=0;
COMMIT;

After the commit there should be no new row created on the publisher.
But because the partial data was replicated, this is what the select
on the subscriber shows :

SELECT * FROM delete_test;
id | name
----+-----------
0 | Nitin
(1 row)

I don't think the above is a common use case. But this is still an
issue because the subscriber has the data which never existed on the
publisher.

I don't think that is the correct conclusion because the user has
intentionally avoided sending part of the transaction changes. This
can happen in various ways without the patch as well. For example, if
the user has performed the ALTER in the same transaction.

Publisher:
=========
BEGIN
postgres=*# Insert into delete_test values(0, 'Nitin');
INSERT 0 1
postgres=*# Alter Publication pub1 drop table delete_test;
ALTER PUBLICATION
postgres=*# Delete from delete_test where id=0;
DELETE 1
postgres=*# commit;
COMMIT
postgres=# select * from delete_test;
id | name
----+------
(0 rows)

Subscriber:
=========
postgres=# select * from delete_test;
id | name
----+-------
0 | Nitin
(1 row)

This can also happen when the user has published only 'inserts' but
not 'updates' or 'deletes'.

Thanks for the clarification. I didn't think of this case. The change
seems fine if this can already happen.

Thanks & Regards
Nitin Motiani
Google

#53

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Shlok Kyal (#49)

Re: long-standing data loss bug in initial sync of logical replication

On Mon, 9 Sept 2024 at 10:41, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

On Mon, 2 Sept 2024 at 10:12, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 30, 2024 at 3:06 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

Next I am planning to test solely on the logical decoding side and
will share the results.

Thanks, the next set of proposed tests makes sense to me. It will also
be useful to generate some worst-case scenarios where the number of
invalidations is more to see the distribution cost in such cases. For
example, Truncate/Drop a table with 100 or 1000 partitions.

--
With Regards,
Amit Kapila.

Hi,

I did some performance testing solely on the logical decoding side and
found some degradation in performance, for the following testcase:
1. Created a publisher on a single table, say 'tab_conc1';
2. Created a second publisher on a single table say 'tp';
4. two sessions are running in parallel, let's say S1 and S2.
5. Begin a transaction in S1.
6. Now in a loop (this loop runs 'count' times):
S1: Insert a row in table 'tab_conc1'
S2: BEGIN; Alter publication DROP/ ADD tp; COMMIT
7. COMMIT the transaction in S1.
8. run 'pg_logical_slot_get_binary_changes' to get the decoding changes.

Observation:
With fix a new entry is added in decoding. During debugging I found
that this entry only comes when we do a 'INSERT' in Session 1 after we
do 'ALTER PUBLICATION' in another session in parallel (or we can say
due to invalidation). Also, I observed that this new entry is related
to sending replica identity, attributes,etc as function
'logicalrep_write_rel' is called.

Performance:
We see a performance degradation as we are sending new entries during
logical decoding. Results are an average of 5 runs.

count | Head (sec) | Fix (sec) | Degradation (%)
------------------------------------------------------------------------------
10000 | 1.298 | 1.574 | 21.26348228
50000 | 22.892 | 24.997 | 9.195352088
100000 | 88.602 | 93.759 | 5.820410374

I have also attached the test script here.

For the above case I tried to investigate the inconsistent degradation
and found out that Serialization was happening for a large number of
'count'. So, I tried adjusting 'logical_decoding_work_mem' to a large
value, so that we can avoid serialization here. I ran the above
performance test again and got the following results:

count | Head (sec) | Fix (sec) | Degradation (%)
-----------------------------------------------------------------------------------
10000 | 0.415446 | 0.53596167 | 29.00874482
50000 | 7.950266 | 10.37375567 | 30.48312685
75000 | 17.192372 | 22.246715 | 29.39875312
100000 | 30.555903 | 39.431542 | 29.04721552

These results are an average of 3 runs. Here the degradation is
consistent around ~30%.

Thanks and Regards,
Shlok Kyal

#54

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Shlok Kyal (#41)

2 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

In the v7 patch, I am looping through the reorder buffer of the
current committed transaction and storing all invalidation messages in
a list. Then I am distributing those invalidations.
But I found that for a transaction we already store all the
invalidation messages (see [1]). So we don't need to loop through the
reorder buffer and store the invalidations.

I have modified the patch accordingly and attached the same.

[1]: https://github.com/postgres/postgres/blob/7da1bdc2c2f17038f2ae1900be90a0d7b5e361e0/src/include/replication/reorderbuffer.h#L384

Hi,

I tried to add changes to selectively invalidate the cache to reduce
the performance degradation during the distribution of invalidations.

Here is the analysis for selective invalidation.
Observation:
Currently when there is a change in a publication, cache related to
all the tables is invalidated including the ones that are not part of
any publication and even tables of different publications. For
example, suppose pub1 includes tables t1 to t1000, while pub2 contains
just table t1001. If pub2 is altered, even though it only has t1001,
this change will also invalidate all the tables t1 through t1000 in
pub1.
Similarly for a namespace, whenever we alter a schema or we add/drop a
schema to the publication, cache related to all the tables is
invalidated including the ones that are on of different schema. For
example, suppose pub1 includes tables t1 to t1000 in schema sc1, while
pub2 contains just table t1001 in schema sc2. If schema ‘sc2’ is
changed or if it is dropped from publication ‘pub2’ even though it
only has t1001, this change will invalidate all the tables t1 through
t1000 in schema sc1.
‘rel_sync_cache_publication_cb’ function is called during the
execution of invalidation in both above cases. And
‘rel_sync_cache_publication_cb’ invalidates all the tables in the
cache.

Solution:
1. When we alter a publication using commands like ‘ALTER PUBLICATION
pub_name DROP TABLE table_name’, first all tables in the publications
are invalidated using the function ‘rel_sync_cache_relation_cb’. Then
again ‘rel_sync_cache_publication_cb’ function is called which
invalidates all the tables. This happens because of the following
callback registered:
CacheRegisterSyscacheCallback(PUBLICATIONRELMAP,
rel_sync_cache_publication_cb, (Datum) 0);

So, I feel this second function call can be avoided. And I have
included changes for the same in the patch. Now the behavior will be
as:
suppose pub1 includes tables t1 to t1000, while pub2 contains just
table t1001. If pub2 is altered, it will only invalidate t1001.

2. When we add/drop a schema to/from a publication using command like
‘ALTER PUBLICATION pub_name ADD TABLES in SCHEMA schema_name’, first
all tables in that schema are invalidated using
‘rel_sync_cache_relation_cb’ and then again
‘rel_sync_cache_publication_cb’ function is called which invalidates
all the tables. This happens because of the following callback
registered:
CacheRegisterSyscacheCallback(PUBLICATIONNAMESPACEMAP,
rel_sync_cache_publication_cb, (Datum) 0);

So, I feel this second function call can be avoided. And I have
included changes for the same in the patch. Now the behavior will be
as:
suppose pub1 includes tables t1 to t1000 in schema sc1, while pub2
contains just table t1001 in schema sc2. If schema ‘sc2’ dropped from
publication ‘pub2’, it will only invalidate table t1001.

3. When we alter a namespace using command like ‘ALTER SCHEMA
schema_name RENAME to new_schema_name’ all the table in cache are
invalidated as ‘rel_sync_cache_publication_cb’ is called due to the
following registered callback:
CacheRegisterSyscacheCallback(NAMESPACEOID,
rel_sync_cache_publication_cb, (Datum) 0);

So, we added a new callback function ‘rel_sync_cache_namespacerel_cb’
will be called instead of function ‘rel_sync_cache_publication_cb’ ,
which invalidates only the cache of the tables which are part of that
particular namespace. For the new function the ‘namespace id’ is added
in the Invalidation message.

For example, if namespace ‘sc1’ has table t1 and t2 and a namespace
‘sc2’ has table t3. Then if we rename namespace ‘sc1’ to ‘sc_new’.
Only tables in sc1 i.e. tables t1 and table t2 are invalidated.

Performance Comparison:
I have run the same tests as shared in [1]/messages/by-id/CANhcyEW4pq6+PO_eFn2q=23sgV1budN3y4SxpYBaKMJNADSDuA@mail.gmail.com and observed a significant
decrease in the degradation with the new changes. With selective
invalidation degradation is around ~5%. This results are an average of
3 runs.

count | Head (sec) | Fix (sec) | Degradation (%)
-----------------------------------------------------------------------------------------
10000 | 0.38842567 | 0.405057 | 4.281727827
50000 | 7.22018834 | 7.605011334 | 5.329819333
75000 | 15.627181 | 16.38659034 | 4.859541462
100000 | 27.37910867 | 28.8636873 | 5.422304458

I have attached the patch for the same
v9-0001 : distribute invalidation to inprogress transaction
v9-0002: Selective invalidation

[1]: /messages/by-id/CANhcyEW4pq6+PO_eFn2q=23sgV1budN3y4SxpYBaKMJNADSDuA@mail.gmail.com

Thanks and Regards,
Shlok Kyal

Attachments:

v9-0001-Distribute-invalidatons-if-change-in-catalog-tabl.patchapplication/x-patch; name=v9-0001-Distribute-invalidatons-if-change-in-catalog-tabl.patchDownload

From 4222dca86e4892fbae6698ed7a6135f61d499d8f Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v9 1/2] Distribute invalidatons if change in catalog tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  34 +++--
 src/include/replication/reorderbuffer.h       |   4 +
 src/test/subscription/t/100_bugs.pl           | 128 ++++++++++++++++++
 4 files changed, 157 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 22bcf171ff..c5dfc1ab06 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -221,9 +221,6 @@ int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED
  */
 static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
 static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -622,7 +619,7 @@ ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0450f94ba8..42c947651b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -300,7 +300,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -859,18 +859,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -913,6 +916,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1184,8 +1195,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeNewCatalogSnapshot(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e332635f70..093d21213a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -743,6 +743,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
+											   TransactionId xid, bool create, bool *is_new,
+											   XLogRecPtr lsn, bool create_as_top);
+
 extern void StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..85d5c0d016 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,134 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', qq(DROP  PUBLICATION pub1;));
+$node_subscriber->safe_psql('postgres', qq(DROP  SUBSCRIPTION sub1;));
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will wait for the lock and can only be completed after
+$background_psql3->query_safe(
+	qq[
+	ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3;
+]);
+
+# Complete the transaction on the tables, so that ALTER PUBLICATION can proceed
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.34.1

v9-0002-Add-Selective-Invalidation-of-Cache.patchapplication/x-patch; name=v9-0002-Add-Selective-Invalidation-of-Cache.patchDownload

From 7ce19b7dafdf659a9689856211e81c501dfe498f Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Wed, 25 Sep 2024 11:41:42 +0530
Subject: [PATCH v9 2/2] Add Selective Invalidation of Cache

When we alter a publication, add/drop namespace to/from publication,
alter a namespace all the cache for all the tables are invalidated. With
this patch for the above operationns we will invalidate the cache of
only the desired tables.
---
 src/backend/replication/pgoutput/pgoutput.c |  52 ++++----
 src/backend/utils/cache/inval.c             | 127 +++++++++++++++++++-
 src/include/storage/sinval.h                |   9 ++
 src/include/utils/inval.h                   |   4 +
 src/test/subscription/t/100_bugs.pl         |  91 ++++++++++++++
 5 files changed, 253 insertions(+), 30 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 00e7024563..ba480e7e48 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -126,6 +126,8 @@ typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
 
+	Oid			schemaid;		/* schema oid */
+
 	bool		replicate_valid;	/* overall validity flag for entry */
 
 	bool		schema_sent;
@@ -216,6 +218,7 @@ static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data,
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
+static void rel_sync_cache_namespacerel_cb(Datum arg, int nspid);
 static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
 											TransactionId xid);
 static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
@@ -1739,12 +1742,6 @@ static void
 publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 {
 	publications_valid = false;
-
-	/*
-	 * Also invalidate per-relation cache so that next time the filtering info
-	 * is checked it will be updated with the new publication settings.
-	 */
-	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
 /*
@@ -1911,26 +1908,7 @@ init_rel_sync_cache(MemoryContext cachectx)
 
 	/* We must update the cache entry for a relation after a relcache flush */
 	CacheRegisterRelcacheCallback(rel_sync_cache_relation_cb, (Datum) 0);
-
-	/*
-	 * Flush all cache entries after a pg_namespace change, in case it was a
-	 * schema rename affecting a relation being replicated.
-	 */
-	CacheRegisterSyscacheCallback(NAMESPACEOID,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
-
-	/*
-	 * Flush all cache entries after any publication changes.  (We need no
-	 * callback entry for pg_publication, because publication_invalidation_cb
-	 * will take care of it.)
-	 */
-	CacheRegisterSyscacheCallback(PUBLICATIONRELMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
-	CacheRegisterSyscacheCallback(PUBLICATIONNAMESPACEMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
+	CacheRegisterNspcacheCallback(rel_sync_cache_namespacerel_cb, (Datum) 0);
 
 	relation_callbacks_registered = true;
 }
@@ -2076,6 +2054,8 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 		entry->estate = NULL;
 		memset(entry->exprstate, 0, sizeof(entry->exprstate));
 
+		entry->schemaid = schemaId;
+
 		/*
 		 * Build publication cache. We can't use one provided by relcache as
 		 * relcache considers all publications that the given relation is in,
@@ -2349,6 +2329,26 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 	}
 }
 
+/*
+ * Namespace invalidation callback
+ */
+static void
+rel_sync_cache_namespacerel_cb(Datum arg, int nspid)
+{
+	HASH_SEQ_STATUS status;
+	RelationSyncEntry *entry;
+
+	if (RelationSyncCache == NULL)
+		return;
+
+	hash_seq_init(&status, RelationSyncCache);
+	while ((entry = (RelationSyncEntry *) hash_seq_search(&status)) != NULL)
+	{
+		if (entry->replicate_valid && entry->schemaid == nspid)
+			entry->replicate_valid = false;
+	}
+}
+
 /* Send Replication origin */
 static void
 send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 603aa4157b..fc0d91aec9 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -114,6 +114,7 @@
 #include "access/xact.h"
 #include "access/xloginsert.h"
 #include "catalog/catalog.h"
+#include "catalog/pg_namespace.h"
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
@@ -160,6 +161,9 @@
  */
 #define CatCacheMsgs 0
 #define RelCacheMsgs 1
+#define NspCacheMsgs 2
+
+#define NumberofCache 3
 
 /* Pointers to main arrays in TopTransactionContext */
 typedef struct InvalMessageArray
@@ -168,13 +172,13 @@ typedef struct InvalMessageArray
 	int			maxmsgs;		/* current allocated size of array */
 } InvalMessageArray;
 
-static InvalMessageArray InvalMessageArrays[2];
+static InvalMessageArray InvalMessageArrays[NumberofCache];
 
 /* Control information for one logical group of messages */
 typedef struct InvalidationMsgsGroup
 {
-	int			firstmsg[2];	/* first index in relevant array */
-	int			nextmsg[2];		/* last+1 index */
+	int			firstmsg[NumberofCache];	/* first index in relevant array */
+	int			nextmsg[NumberofCache];		/* last+1 index */
 } InvalidationMsgsGroup;
 
 /* Macros to help preserve InvalidationMsgsGroup abstraction */
@@ -189,6 +193,7 @@ typedef struct InvalidationMsgsGroup
 	do { \
 		SetSubGroupToFollow(targetgroup, priorgroup, CatCacheMsgs); \
 		SetSubGroupToFollow(targetgroup, priorgroup, RelCacheMsgs); \
+		SetSubGroupToFollow(targetgroup, priorgroup, NspCacheMsgs);	\
 	} while (0)
 
 #define NumMessagesInSubGroup(group, subgroup) \
@@ -196,7 +201,8 @@ typedef struct InvalidationMsgsGroup
 
 #define NumMessagesInGroup(group) \
 	(NumMessagesInSubGroup(group, CatCacheMsgs) + \
-	 NumMessagesInSubGroup(group, RelCacheMsgs))
+	 NumMessagesInSubGroup(group, RelCacheMsgs) + \
+	 NumMessagesInSubGroup(group, NspCacheMsgs))
 
 
 /*----------------
@@ -251,6 +257,7 @@ int			debug_discard_caches = 0;
 
 #define MAX_SYSCACHE_CALLBACKS 64
 #define MAX_RELCACHE_CALLBACKS 10
+#define MAX_NSPCACHE_CALLBACKS 10
 
 static struct SYSCACHECALLBACK
 {
@@ -270,7 +277,14 @@ static struct RELCACHECALLBACK
 	Datum		arg;
 }			relcache_callback_list[MAX_RELCACHE_CALLBACKS];
 
+static struct NSPCACHECALLBACK
+{
+	NspcacheCallbackFunction function;
+	Datum		arg;
+}			nspcache_callback_list[MAX_NSPCACHE_CALLBACKS];
+
 static int	relcache_callback_count = 0;
+static int	nspcache_callback_count = 0;
 
 /* ----------------------------------------------------------------
  *				Invalidation subgroup support functions
@@ -464,6 +478,35 @@ AddRelcacheInvalidationMessage(InvalidationMsgsGroup *group,
 	AddInvalidationMessage(group, RelCacheMsgs, &msg);
 }
 
+static void
+AddNspcacheInvalidationMessage(InvalidationMsgsGroup *group,
+							   Oid dbId, Oid nspId)
+{
+	SharedInvalidationMessage msg;
+
+	/*
+	 * Don't add a duplicate item. We assume dbId need not be checked because
+	 * it will never change. InvalidOid for relId means all relations so we
+	 * don't need to add individual ones when it is present.
+	 */
+
+	ProcessMessageSubGroup(group, NspCacheMsgs,
+						   if (msg->nc.id == SHAREDINVALNSPCACHE_ID &&
+							   (msg->nc.nspId == nspId ||
+								msg->nc.nspId == InvalidOid))
+						   return);
+
+
+	/* OK, add the item */
+	msg.nc.id = SHAREDINVALNSPCACHE_ID;
+	msg.nc.dbId = dbId;
+	msg.nc.nspId = nspId;
+	/* check AddCatcacheInvalidationMessage() for an explanation */
+	VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+	AddInvalidationMessage(group, NspCacheMsgs, &msg);
+}
+
 /*
  * Add a snapshot inval entry
  *
@@ -502,6 +545,7 @@ AppendInvalidationMessages(InvalidationMsgsGroup *dest,
 {
 	AppendInvalidationMessageSubGroup(dest, src, CatCacheMsgs);
 	AppendInvalidationMessageSubGroup(dest, src, RelCacheMsgs);
+	AppendInvalidationMessageSubGroup(dest, src, NspCacheMsgs);
 }
 
 /*
@@ -516,6 +560,7 @@ ProcessInvalidationMessages(InvalidationMsgsGroup *group,
 {
 	ProcessMessageSubGroup(group, CatCacheMsgs, func(msg));
 	ProcessMessageSubGroup(group, RelCacheMsgs, func(msg));
+	ProcessMessageSubGroup(group, NspCacheMsgs, func(msg));
 }
 
 /*
@@ -528,6 +573,7 @@ ProcessInvalidationMessagesMulti(InvalidationMsgsGroup *group,
 {
 	ProcessMessageSubGroupMulti(group, CatCacheMsgs, func(msgs, n));
 	ProcessMessageSubGroupMulti(group, RelCacheMsgs, func(msgs, n));
+	ProcessMessageSubGroupMulti(group, NspCacheMsgs, func(msgs, n));
 }
 
 /* ----------------------------------------------------------------
@@ -590,6 +636,18 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 		transInvalInfo->RelcacheInitFileInval = true;
 }
 
+/*
+ * RegisterNspcacheInvalidation
+ *
+ * As above, but register a namespace invalidation event.
+ */
+static void
+RegisterNspcacheInvalidation(Oid dbId, Oid nspId)
+{
+	AddNspcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
+								   dbId, nspId);
+}
+
 /*
  * RegisterSnapshotInvalidation
  *
@@ -660,6 +718,8 @@ PrepareInvalidationState(void)
 		InvalMessageArrays[CatCacheMsgs].maxmsgs = 0;
 		InvalMessageArrays[RelCacheMsgs].msgs = NULL;
 		InvalMessageArrays[RelCacheMsgs].maxmsgs = 0;
+		InvalMessageArrays[NspCacheMsgs].msgs = NULL;
+		InvalMessageArrays[NspCacheMsgs].maxmsgs = 0;
 	}
 
 	transInvalInfo = myInfo;
@@ -773,6 +833,20 @@ LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg)
 		else if (msg->sn.dbId == MyDatabaseId)
 			InvalidateCatalogSnapshot();
 	}
+	else if (msg->id == SHAREDINVALNSPCACHE_ID)
+	{
+		if (msg->nc.dbId == MyDatabaseId || msg->nc.dbId == InvalidOid)
+		{
+			int			i;
+
+			for (i = 0; i < nspcache_callback_count; i++)
+			{
+				struct NSPCACHECALLBACK *ncitem = nspcache_callback_list + i;
+
+				ncitem->function(ncitem->arg, msg->nc.nspId);
+			}
+		}
+	}
 	else
 		elog(FATAL, "unrecognized SI message ID: %d", msg->id);
 }
@@ -944,6 +1018,18 @@ xactGetCommittedInvalidationMessages(SharedInvalidationMessage **msgs,
 										msgs,
 										n * sizeof(SharedInvalidationMessage)),
 								 nmsgs += n));
+	ProcessMessageSubGroupMulti(&transInvalInfo->PriorCmdInvalidMsgs,
+								NspCacheMsgs,
+								(memcpy(msgarray + nmsgs,
+										msgs,
+										n * sizeof(SharedInvalidationMessage)),
+								 nmsgs += n));
+	ProcessMessageSubGroupMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								NspCacheMsgs,
+								(memcpy(msgarray + nmsgs,
+										msgs,
+										n * sizeof(SharedInvalidationMessage)),
+								 nmsgs += n));
 	Assert(nmsgs == nummsgs);
 
 	return nmsgs;
@@ -1312,6 +1398,17 @@ CacheInvalidateHeapTuple(Relation relation,
 		else
 			return;
 	}
+	else if (tupleRelId == NamespaceRelationId)
+	{
+		Form_pg_namespace nsptup = (Form_pg_namespace) GETSTRUCT(tuple);
+
+		/* get namespace id */
+		relationId = nsptup->oid;
+		databaseId = MyDatabaseId;
+
+		RegisterNspcacheInvalidation(databaseId, relationId);
+		return;
+	}
 	else
 		return;
 
@@ -1567,6 +1664,25 @@ CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 	++relcache_callback_count;
 }
 
+/*
+ * CacheRegisterNspcacheCallback
+ *		Register the specified function to be called for all future
+ *		namespace invalidation events.  The OID of the namespace being
+ *		invalidated will be passed to the function.
+ */
+void
+CacheRegisterNspcacheCallback(NspcacheCallbackFunction func,
+							  Datum arg)
+{
+	if (nspcache_callback_count >= MAX_NSPCACHE_CALLBACKS)
+		elog(FATAL, "out of nspcache_callback_list slots");
+
+	nspcache_callback_list[nspcache_callback_count].function = func;
+	nspcache_callback_list[nspcache_callback_count].arg = arg;
+
+	++nspcache_callback_count;
+}
+
 /*
  * CallSyscacheCallbacks
  *
@@ -1629,6 +1745,9 @@ LogLogicalInvalidations(void)
 		ProcessMessageSubGroupMulti(group, RelCacheMsgs,
 									XLogRegisterData((char *) msgs,
 													 n * sizeof(SharedInvalidationMessage)));
+		ProcessMessageSubGroupMulti(group, NspCacheMsgs,
+									XLogRegisterData((char *) msgs,
+													 n * sizeof(SharedInvalidationMessage)));
 		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
 	}
 }
diff --git a/src/include/storage/sinval.h b/src/include/storage/sinval.h
index 8f5744b21b..4c53012528 100644
--- a/src/include/storage/sinval.h
+++ b/src/include/storage/sinval.h
@@ -110,6 +110,14 @@ typedef struct
 	Oid			relId;			/* relation ID */
 } SharedInvalSnapshotMsg;
 
+#define SHAREDINVALNSPCACHE_ID	(-6)
+typedef struct
+{
+	int8		id;				/* type field --- must be first */
+	Oid			dbId;			/* database ID, or 0 if a shared relation */
+	Oid			nspId;			/* namespace ID */
+} SharedInvalNspcacheMsg;
+
 typedef union
 {
 	int8		id;				/* type field --- must be first */
@@ -119,6 +127,7 @@ typedef union
 	SharedInvalSmgrMsg sm;
 	SharedInvalRelmapMsg rm;
 	SharedInvalSnapshotMsg sn;
+	SharedInvalNspcacheMsg nc;
 } SharedInvalidationMessage;
 
 
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 24695facf2..99a0c90b6d 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -22,6 +22,7 @@ extern PGDLLIMPORT int debug_discard_caches;
 
 typedef void (*SyscacheCallbackFunction) (Datum arg, int cacheid, uint32 hashvalue);
 typedef void (*RelcacheCallbackFunction) (Datum arg, Oid relid);
+typedef void (*NspcacheCallbackFunction) (Datum arg, Oid nspid);
 
 
 extern void AcceptInvalidationMessages(void);
@@ -59,6 +60,9 @@ extern void CacheRegisterSyscacheCallback(int cacheid,
 extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 										  Datum arg);
 
+extern void CacheRegisterNspcacheCallback(NspcacheCallbackFunction func,
+										  Datum arg);
+
 extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 
 extern void InvalidateSystemCaches(void);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 85d5c0d016..e038dd8a87 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -611,6 +611,97 @@ is( $result, qq(1
 	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
 );
 
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+]);
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (4);
+]);
+
+$background_psql3->query_safe(
+	qq[
+	ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc, TABLES IN SCHEMA sch3;
+]);
+
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+	INSERT INTO sch3.tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$background_psql3->query_safe(
+	qq[
+	ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3;
+]);
+
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (6);
+]);
+
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (6);
+]);
+
+$background_psql3->query_safe(
+	qq[
+	DROP PUBLICATION regress_pub1;
+]);
+
+$background_psql1->query_safe(
+	qq[
+	INSERT INTO tab_conc VALUES (7);
+]);
+
+$background_psql2->query_safe(
+	qq[
+	INSERT INTO sch3.tab_conc VALUES (7);
+]);
+
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+my $offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/,
+	$offset);
+
+$node_subscriber->safe_psql('postgres',
+	'DROP SUBSCRIPTION regress_sub1;');
+
 $background_psql1->quit;
 $background_psql2->quit;
 $background_psql3->quit;
-- 
2.34.1

#55

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 1 year ago

In reply to: Shlok Kyal (#54)

RE: long-standing data loss bug in initial sync of logical replication

Dear Shlok,

Hi,

I tried to add changes to selectively invalidate the cache to reduce
the performance degradation during the distribution of invalidations.

Thanks for improving the patch!

...

Solution:
1. When we alter a publication using commands like ‘ALTER PUBLICATION
pub_name DROP TABLE table_name’, first all tables in the publications
are invalidated using the function ‘rel_sync_cache_relation_cb’. Then
again ‘rel_sync_cache_publication_cb’ function is called which
invalidates all the tables.

On my environment, rel_sync_cache_publication_cb() was called first and invalidate
all the entries, then rel_sync_cache_relation_cb() was called and the specified
entry is invalidated - hence second is NO-OP.

This happens because of the following
callback registered:
CacheRegisterSyscacheCallback(PUBLICATIONRELMAP,
rel_sync_cache_publication_cb, (Datum) 0);

But even in this case, I could understand that you want to remove the
rel_sync_cache_publication_cb() callback.

2. When we add/drop a schema to/from a publication using command like
‘ALTER PUBLICATION pub_name ADD TABLES in SCHEMA schema_name’, first
all tables in that schema are invalidated using
‘rel_sync_cache_relation_cb’ and then again
‘rel_sync_cache_publication_cb’ function is called which invalidates
all the tables.

Even in this case, rel_sync_cache_publication_cb() was called first and then
rel_sync_cache_relation_cb().

3. When we alter a namespace using command like ‘ALTER SCHEMA
schema_name RENAME to new_schema_name’ all the table in cache are
invalidated as ‘rel_sync_cache_publication_cb’ is called due to the
following registered callback:
CacheRegisterSyscacheCallback(NAMESPACEOID,
rel_sync_cache_publication_cb, (Datum) 0);

So, we added a new callback function ‘rel_sync_cache_namespacerel_cb’
will be called instead of function ‘rel_sync_cache_publication_cb’ ,
which invalidates only the cache of the tables which are part of that
particular namespace. For the new function the ‘namespace id’ is added
in the Invalidation message.

Hmm, I feel this fix is too much. Unlike ALTER PUBLICATION statements, I think
ALTER SCHEMA is rarely executed at the production stage. However, this approach
requires adding a new cache callback system, which affects the entire postgres
system; this is not very beneficial compared to the outcome. It should be discussed
on another thread to involve more people, and then we can add the improvement
after being accepted.

Performance Comparison:
I have run the same tests as shared in [1] and observed a significant
decrease in the degradation with the new changes. With selective
invalidation degradation is around ~5%. This results are an average of
3 runs.

IIUC, the executed workload did not contain ALTER SCHEMA command, so
third improvement did not contribute this improvement.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#56

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Hayato Kuroda (Fujitsu) (#55)

2 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

Hi Kuroda-san,

Thanks for reviewing the patch.

Solution:
1. When we alter a publication using commands like ‘ALTER PUBLICATION
pub_name DROP TABLE table_name’, first all tables in the publications
are invalidated using the function ‘rel_sync_cache_relation_cb’. Then
again ‘rel_sync_cache_publication_cb’ function is called which
invalidates all the tables.

On my environment, rel_sync_cache_publication_cb() was called first and invalidate
all the entries, then rel_sync_cache_relation_cb() was called and the specified
entry is invalidated - hence second is NO-OP.

You are correct. I made a silly mistake while writing the write-up.
rel_sync_cache_publication_cb() is called first and invalidate all the
entries, then rel_sync_cache_relation_cb() is called and the specified
entry is invalidated

This happens because of the following
callback registered:
CacheRegisterSyscacheCallback(PUBLICATIONRELMAP,
rel_sync_cache_publication_cb, (Datum) 0);

But even in this case, I could understand that you want to remove the
rel_sync_cache_publication_cb() callback.

Yes, I think rel_sync_cache_publication_cb() callback can be removed,
as it is invalidating all the other tables as well (which are not in
this publication).

2. When we add/drop a schema to/from a publication using command like
‘ALTER PUBLICATION pub_name ADD TABLES in SCHEMA schema_name’, first
all tables in that schema are invalidated using
‘rel_sync_cache_relation_cb’ and then again
‘rel_sync_cache_publication_cb’ function is called which invalidates
all the tables.

Even in this case, rel_sync_cache_publication_cb() was called first and then
rel_sync_cache_relation_cb().

Yes, your observation is correct. rel_sync_cache_publication_cb() is
called first and then rel_sync_cache_relation_cb().

3. When we alter a namespace using command like ‘ALTER SCHEMA
schema_name RENAME to new_schema_name’ all the table in cache are
invalidated as ‘rel_sync_cache_publication_cb’ is called due to the
following registered callback:
CacheRegisterSyscacheCallback(NAMESPACEOID,
rel_sync_cache_publication_cb, (Datum) 0);

So, we added a new callback function ‘rel_sync_cache_namespacerel_cb’
will be called instead of function ‘rel_sync_cache_publication_cb’ ,
which invalidates only the cache of the tables which are part of that
particular namespace. For the new function the ‘namespace id’ is added
in the Invalidation message.

Hmm, I feel this fix is too much. Unlike ALTER PUBLICATION statements, I think
ALTER SCHEMA is rarely executed at the production stage. However, this approach
requires adding a new cache callback system, which affects the entire postgres
system; this is not very beneficial compared to the outcome. It should be discussed
on another thread to involve more people, and then we can add the improvement
after being accepted.

Yes, I also agree with you. I have removed the changes in the updated patch.

Performance Comparison:
I have run the same tests as shared in [1] and observed a significant
decrease in the degradation with the new changes. With selective
invalidation degradation is around ~5%. This results are an average of
3 runs.

IIUC, the executed workload did not contain ALTER SCHEMA command, so
third improvement did not contribute this improvement.

I have removed the changes corresponding to the third improvement.

I have addressed the comment for 0002 patch and attached the patches.
Also, I have moved the tests in the 0002 to 0001 patch.

Thanks and Regards,
Shlok Kyal

Attachments:

v10-0002-Selective-Invalidation-of-Cache.patchapplication/octet-stream; name=v10-0002-Selective-Invalidation-of-Cache.patchDownload

From f671f13607ef87c413cecefc26f2eb5b6a81faef Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 27 Sep 2024 16:04:54 +0530
Subject: [PATCH v10 2/2] Selective Invalidation of Cache

When we alter a publication, add/drop namespace to/from publication
all the cache for all the tables are invalidated.
With this patch for the above operationns we will invalidate the
cache of only the desired tables.
---
 src/backend/replication/pgoutput/pgoutput.c | 18 ------------------
 1 file changed, 18 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 00e7024563..b8429be8cf 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -1739,12 +1739,6 @@ static void
 publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 {
 	publications_valid = false;
-
-	/*
-	 * Also invalidate per-relation cache so that next time the filtering info
-	 * is checked it will be updated with the new publication settings.
-	 */
-	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
 /*
@@ -1920,18 +1914,6 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  rel_sync_cache_publication_cb,
 								  (Datum) 0);
 
-	/*
-	 * Flush all cache entries after any publication changes.  (We need no
-	 * callback entry for pg_publication, because publication_invalidation_cb
-	 * will take care of it.)
-	 */
-	CacheRegisterSyscacheCallback(PUBLICATIONRELMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
-	CacheRegisterSyscacheCallback(PUBLICATIONNAMESPACEMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
-
 	relation_callbacks_registered = true;
 }
 
-- 
2.34.1

v10-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchapplication/octet-stream; name=v10-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchDownload

From 3f4ddd0be632a736b672c3ea4d2e572ba1d1b4ef Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v10 1/2] Distribute invalidatons if change in catalog tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  34 ++-
 src/include/replication/reorderbuffer.h       |   4 +
 src/test/subscription/t/100_bugs.pl           | 236 ++++++++++++++++++
 4 files changed, 265 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 22bcf171ff..c5dfc1ab06 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -221,9 +221,6 @@ int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED
  */
 static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
 static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -622,7 +619,7 @@ ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0450f94ba8..42c947651b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -300,7 +300,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -859,18 +859,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -913,6 +916,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1184,8 +1195,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeNewCatalogSnapshot(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e332635f70..093d21213a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -743,6 +743,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
+											   TransactionId xid, bool create, bool *is_new,
+											   XLogRecPtr lsn, bool create_as_top);
+
 extern void StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..77e8a24497 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,242 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', qq(DROP  PUBLICATION pub1;));
+$node_subscriber->safe_psql('postgres', qq(DROP  SUBSCRIPTION sub1;));
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	qq[
+	ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3;
+]);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even when
+# tables are dropped from the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Maintain an active transaction with the table that will be dropped from the
+# publication.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+]);
+
+# Maintain an active transaction with a schema table that will be dropped from the
+# publication.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (4);
+]);
+
+# Drop the table from the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	qq[
+	ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc, TABLES IN SCHEMA sch3;
+]);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+	INSERT INTO sch3.tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is not replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that data for table tab_conc are not replicated to subscriber');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table sch3.tab_conc are not replicated to subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is dropped in a concurrent active transaction.
+
+# Add tables to the publication.
+$background_psql3->query_safe(
+	qq[
+	ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3;
+]);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (6);
+]);
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (6);
+]);
+
+# Drop publication.
+$background_psql3->query_safe(
+	qq[
+	DROP PUBLICATION regress_pub1;
+]);
+
+# Perform an insert.
+$background_psql1->query_safe(
+	qq[
+	INSERT INTO tab_conc VALUES (7);
+]);
+
+$background_psql2->query_safe(
+	qq[
+	INSERT INTO sch3.tab_conc VALUES (7);
+]);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+# ERROR should appear on subscriber.
+my $offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$node_subscriber->safe_psql('postgres', 'DROP SUBSCRIPTION regress_sub1;');
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.34.1

#57

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Shlok Kyal (#54)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Thu, 26 Sept 2024 at 11:39, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

In the v7 patch, I am looping through the reorder buffer of the
current committed transaction and storing all invalidation messages in
a list. Then I am distributing those invalidations.
But I found that for a transaction we already store all the
invalidation messages (see [1]). So we don't need to loop through the
reorder buffer and store the invalidations.

I have modified the patch accordingly and attached the same.

[1]: https://github.com/postgres/postgres/blob/7da1bdc2c2f17038f2ae1900be90a0d7b5e361e0/src/include/replication/reorderbuffer.h#L384

Hi,

I tried to add changes to selectively invalidate the cache to reduce
the performance degradation during the distribution of invalidations.

Here is the analysis for selective invalidation.
Observation:
Currently when there is a change in a publication, cache related to
all the tables is invalidated including the ones that are not part of
any publication and even tables of different publications. For
example, suppose pub1 includes tables t1 to t1000, while pub2 contains
just table t1001. If pub2 is altered, even though it only has t1001,
this change will also invalidate all the tables t1 through t1000 in
pub1.
Similarly for a namespace, whenever we alter a schema or we add/drop a
schema to the publication, cache related to all the tables is
invalidated including the ones that are on of different schema. For
example, suppose pub1 includes tables t1 to t1000 in schema sc1, while
pub2 contains just table t1001 in schema sc2. If schema ‘sc2’ is
changed or if it is dropped from publication ‘pub2’ even though it
only has t1001, this change will invalidate all the tables t1 through
t1000 in schema sc1.
‘rel_sync_cache_publication_cb’ function is called during the
execution of invalidation in both above cases. And
‘rel_sync_cache_publication_cb’ invalidates all the tables in the
cache.

Solution:
1. When we alter a publication using commands like ‘ALTER PUBLICATION
pub_name DROP TABLE table_name’, first all tables in the publications
are invalidated using the function ‘rel_sync_cache_relation_cb’. Then
again ‘rel_sync_cache_publication_cb’ function is called which
invalidates all the tables. This happens because of the following
callback registered:
CacheRegisterSyscacheCallback(PUBLICATIONRELMAP,
rel_sync_cache_publication_cb, (Datum) 0);

So, I feel this second function call can be avoided. And I have
included changes for the same in the patch. Now the behavior will be
as:
suppose pub1 includes tables t1 to t1000, while pub2 contains just
table t1001. If pub2 is altered, it will only invalidate t1001.

2. When we add/drop a schema to/from a publication using command like
‘ALTER PUBLICATION pub_name ADD TABLES in SCHEMA schema_name’, first
all tables in that schema are invalidated using
‘rel_sync_cache_relation_cb’ and then again
‘rel_sync_cache_publication_cb’ function is called which invalidates
all the tables. This happens because of the following callback
registered:
CacheRegisterSyscacheCallback(PUBLICATIONNAMESPACEMAP,
rel_sync_cache_publication_cb, (Datum) 0);

So, I feel this second function call can be avoided. And I have
included changes for the same in the patch. Now the behavior will be
as:
suppose pub1 includes tables t1 to t1000 in schema sc1, while pub2
contains just table t1001 in schema sc2. If schema ‘sc2’ dropped from
publication ‘pub2’, it will only invalidate table t1001.

3. When we alter a namespace using command like ‘ALTER SCHEMA
schema_name RENAME to new_schema_name’ all the table in cache are
invalidated as ‘rel_sync_cache_publication_cb’ is called due to the
following registered callback:
CacheRegisterSyscacheCallback(NAMESPACEOID,
rel_sync_cache_publication_cb, (Datum) 0);

So, we added a new callback function ‘rel_sync_cache_namespacerel_cb’
will be called instead of function ‘rel_sync_cache_publication_cb’ ,
which invalidates only the cache of the tables which are part of that
particular namespace. For the new function the ‘namespace id’ is added
in the Invalidation message.

For example, if namespace ‘sc1’ has table t1 and t2 and a namespace
‘sc2’ has table t3. Then if we rename namespace ‘sc1’ to ‘sc_new’.
Only tables in sc1 i.e. tables t1 and table t2 are invalidated.

Performance Comparison:
I have run the same tests as shared in [1] and observed a significant
decrease in the degradation with the new changes. With selective
invalidation degradation is around ~5%. This results are an average of
3 runs.

count | Head (sec) | Fix (sec) | Degradation (%)
-----------------------------------------------------------------------------------------
10000 | 0.38842567 | 0.405057 | 4.281727827
50000 | 7.22018834 | 7.605011334 | 5.329819333
75000 | 15.627181 | 16.38659034 | 4.859541462
100000 | 27.37910867 | 28.8636873 | 5.422304458

I have attached the patch for the same
v9-0001 : distribute invalidation to inprogress transaction
v9-0002: Selective invalidation

[1]:/messages/by-id/CANhcyEW4pq6+PO_eFn2q=23sgV1budN3y4SxpYBaKMJNADSDuA@mail.gmail.com

I have also prepared a bar chart for performance comparison between
HEAD, 0001 patch and (0001+0002) patch and attached here.

Thanks and Regards,
Shlok Kyal

#58

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 1 year ago

In reply to: Shlok Kyal (#56)

RE: long-standing data loss bug in initial sync of logical replication

Dear Shlok,

I have addressed the comment for 0002 patch and attached the patches.
Also, I have moved the tests in the 0002 to 0001 patch.

Thanks for updating the patch. 0002 patch seems to remove cache invalidations
from publication_invalidation_cb(). Related with it, I found an issue and had a concern.

1.
The replication continues even after ALTER PUBLICATION RENAME is executed.
For example - assuming that a subscriber subscribes only "pub":

```
pub=# INSERT INTO tab values (1);
INSERT 0 1
pub=# ALTER PUBLICATION pub RENAME TO pub1;
ALTER PUBLICATION
pub=# INSERT INTO tab values (2);
INSERT 0 1

sub=# SELECT * FROM tab ; -- (2) should not be replicated however...
a
---
1
2
(2 rows)
```

This happens because 1) ALTER PUBLICATION RENAME statement won't be invalidate the
relation cache, and 2) publications are reloaded only when invalid RelationSyncEntry
is found. In given example, the first INSERT creates the valid cache and second
INSERT reuses it. Therefore, the pubname-check is skipped.

For now, the actual renaming is done at AlterObjectRename_internal(), a generic
function. I think we must implement a dedecated function to publication and must
invalidate relcaches there.

2.
Similarly with above, the relcache won't be invalidated when ALTER PUBLICATION
OWNER TO is executed. This means that privilage checks may be ignored if the entry
is still valid. Not sure, but is there a possibility this causes an inconsistency?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#59

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 1 year ago

In reply to: Hayato Kuroda (Fujitsu) (#58)

RE: long-standing data loss bug in initial sync of logical replication

2.
Similarly with above, the relcache won't be invalidated when ALTER
PUBLICATION
OWNER TO is executed. This means that privilage checks may be ignored if the
entry
is still valid. Not sure, but is there a possibility this causes an inconsistency?

Hmm, IIUC, the attribute pubowner is not used for now. The paragpargh
"There are currently no privileges on publications...." [1]https://www.postgresql.org/docs/devel/logical-replication-security.html may show the current
status. However, to keep the current behavior, I suggest to invalidate the relcache
of pubrelations when the owner is altered.

[1]: https://www.postgresql.org/docs/devel/logical-replication-security.html

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#60

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Hayato Kuroda (Fujitsu) (#58)

2 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

I have addressed the comment for 0002 patch and attached the patches.
Also, I have moved the tests in the 0002 to 0001 patch.

Thanks for updating the patch. 0002 patch seems to remove cache invalidations
from publication_invalidation_cb(). Related with it, I found an issue and had a concern.

1.
The replication continues even after ALTER PUBLICATION RENAME is executed.
For example - assuming that a subscriber subscribes only "pub":

```
pub=# INSERT INTO tab values (1);
INSERT 0 1
pub=# ALTER PUBLICATION pub RENAME TO pub1;
ALTER PUBLICATION
pub=# INSERT INTO tab values (2);
INSERT 0 1

sub=# SELECT * FROM tab ; -- (2) should not be replicated however...
a
---
1
2
(2 rows)
```

This happens because 1) ALTER PUBLICATION RENAME statement won't be invalidate the
relation cache, and 2) publications are reloaded only when invalid RelationSyncEntry
is found. In given example, the first INSERT creates the valid cache and second
INSERT reuses it. Therefore, the pubname-check is skipped.

For now, the actual renaming is done at AlterObjectRename_internal(), a generic
function. I think we must implement a dedecated function to publication and must
invalidate relcaches there.

2.
Similarly with above, the relcache won't be invalidated when ALTER PUBLICATION
OWNER TO is executed. This means that privilage checks may be ignored if the entry
is still valid. Not sure, but is there a possibility this causes an inconsistency?

Hi Kuroda-san,

Thanks for testing the patch. I have fixed the comments and attached
the updated patch.
I have added a callback function rel_sync_cache_publicationrel_cb().
This callback invalidates the cache of tables in a particular
publication.
This callback is called when there is some modification in
pg_publication catalog.

I have tested the two cases 'ALTER PUBLICATION ... RENAME TO ...' and
'ALTER PUBLICATION ... OWNER TO ...' and debugged it. The newly
added callback is called and it invalidates the cache of tables
present in that particular publication.
I have also added a test related to 'ALTER PUBLICATION ... RENAME TO
...' to 0001 patch.

Thanks and Regards,
Shlok Kyal

Attachments:

v11-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchapplication/octet-stream; name=v11-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchDownload

From 7eacfcd36cf86b35c443d32f4e7392a0f5a274dc Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v11 1/2] Distribute invalidatons if change in catalog tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  34 +-
 src/include/replication/reorderbuffer.h       |   4 +
 src/test/subscription/t/100_bugs.pl           | 291 ++++++++++++++++++
 4 files changed, 320 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 22bcf171ff..c5dfc1ab06 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -221,9 +221,6 @@ int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED
  */
 static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
 static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -622,7 +619,7 @@ ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0450f94ba8..42c947651b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -300,7 +300,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -859,18 +859,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -913,6 +916,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1184,8 +1195,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeNewCatalogSnapshot(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e332635f70..093d21213a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -743,6 +743,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
+											   TransactionId xid, bool create, bool *is_new,
+											   XLogRecPtr lsn, bool create_as_top);
+
 extern void StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..bdabe53e42 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,297 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', qq(DROP  PUBLICATION pub1;));
+$node_subscriber->safe_psql('postgres', qq(DROP  SUBSCRIPTION sub1;));
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	qq[
+	ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3;
+]);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even when
+# tables are dropped from the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Maintain an active transaction with the table that will be dropped from the
+# publication.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+]);
+
+# Maintain an active transaction with a schema table that will be dropped from the
+# publication.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (4);
+]);
+
+# Drop the table from the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	qq[
+	ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc, TABLES IN SCHEMA sch3;
+]);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+	INSERT INTO sch3.tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is not replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that data for table tab_conc are not replicated to subscriber');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table sch3.tab_conc are not replicated to subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is dropped in a concurrent active transaction.
+
+# Add tables to the publication.
+$background_psql3->query_safe(
+	qq[
+	ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3;
+]);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (6);
+]);
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (6);
+]);
+
+# Drop publication.
+$background_psql3->query_safe(
+	qq[
+	DROP PUBLICATION regress_pub1;
+]);
+
+# Perform an insert.
+$background_psql1->query_safe(
+	qq[
+	INSERT INTO tab_conc VALUES (7);
+]);
+
+$background_psql2->query_safe(
+	qq[
+	INSERT INTO sch3.tab_conc VALUES (7);
+]);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+# ERROR should appear on subscriber.
+my $offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$node_subscriber->safe_psql('postgres', 'DROP SUBSCRIPTION regress_sub1;');
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is renamed in a concurrent active transaction.
+
+# Create publication.
+$background_psql3->query_safe(
+	qq[
+	CREATE PUBLICATION regress_pub1 FOR TABLE tab_conc, TABLES IN SCHEMA sch3;
+]);
+
+# Create subscription.
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (8);
+]);
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (8);
+]);
+
+# Rename publication.
+$background_psql3->query_safe(
+	qq[
+	ALTER PUBLICATION regress_pub1 RENAME TO regress_pub1_rename;
+]);
+
+# Perform an insert.
+$background_psql1->query_safe(
+	qq[
+	INSERT INTO tab_conc VALUES (9);
+]);
+
+$background_psql2->query_safe(
+	qq[
+	INSERT INTO sch3.tab_conc VALUES (9);
+]);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+# ERROR should appear on subscriber.
+$offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.34.1

v11-0002-Selective-Invalidation-of-Cache.patchapplication/octet-stream; name=v11-0002-Selective-Invalidation-of-Cache.patchDownload

From bda601d01c2cba6a6544e36d51fc62b79d8a3f53 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 27 Sep 2024 16:04:54 +0530
Subject: [PATCH v11 2/2] Selective Invalidation of Cache

When we alter a publication, add/drop namespace to/from publication
all the cache for all the tables are invalidated.
With this patch for the above operationns we will invalidate the
cache of only the desired tables.

Added a new callback function 'rel_sync_cache_publicationrel_cb' which
is called when there is any change in pg_publication catalog and it
invalidates the tables present in the publication modified.
---
 src/backend/replication/pgoutput/pgoutput.c |  58 ++++++---
 src/backend/utils/cache/inval.c             | 130 +++++++++++++++++++-
 src/include/storage/sinval.h                |   9 ++
 src/include/utils/inval.h                   |   4 +
 4 files changed, 179 insertions(+), 22 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 00e7024563..1d80d27d0f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -132,6 +132,8 @@ typedef struct RelationSyncEntry
 	List	   *streamed_txns;	/* streamed toplevel transactions with this
 								 * schema */
 
+	List	   *pub_ids;
+
 	/* are we publishing this rel? */
 	PublicationActions pubactions;
 
@@ -216,6 +218,7 @@ static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data,
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
+static void rel_sync_cache_publicationrel_cb(Datum arg, Oid pubid);
 static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
 											TransactionId xid);
 static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
@@ -1134,7 +1137,7 @@ init_tuple_slot(PGOutputData *data, Relation relation,
 	TupleDesc	oldtupdesc;
 	TupleDesc	newtupdesc;
 
-	oldctx = MemoryContextSwitchTo(data->cachectx);
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
 
 	/*
 	 * Create tuple table slots. Create a copy of the TupleDesc as it needs to
@@ -1739,12 +1742,6 @@ static void
 publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 {
 	publications_valid = false;
-
-	/*
-	 * Also invalidate per-relation cache so that next time the filtering info
-	 * is checked it will be updated with the new publication settings.
-	 */
-	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
 /*
@@ -1920,17 +1917,7 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  rel_sync_cache_publication_cb,
 								  (Datum) 0);
 
-	/*
-	 * Flush all cache entries after any publication changes.  (We need no
-	 * callback entry for pg_publication, because publication_invalidation_cb
-	 * will take care of it.)
-	 */
-	CacheRegisterSyscacheCallback(PUBLICATIONRELMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
-	CacheRegisterSyscacheCallback(PUBLICATIONNAMESPACEMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
+	CacheRegisterPubcacheCallback(rel_sync_cache_publicationrel_cb, (Datum) 0);
 
 	relation_callbacks_registered = true;
 }
@@ -2000,6 +1987,7 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 		entry->publish_as_relid = InvalidOid;
 		entry->columns = NULL;
 		entry->attrmap = NULL;
+		entry->pub_ids = NIL;
 	}
 
 	/* Validate the entry */
@@ -2044,6 +2032,8 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 		entry->schema_sent = false;
 		list_free(entry->streamed_txns);
 		entry->streamed_txns = NIL;
+		list_free(entry->pub_ids);
+		entry->pub_ids = NIL;
 		bms_free(entry->columns);
 		entry->columns = NULL;
 		entry->pubactions.pubinsert = false;
@@ -2108,6 +2098,10 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 
 					pub_relid = llast_oid(ancestors);
 					ancestor_level = list_length(ancestors);
+
+					oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+					entry->pub_ids = lappend_oid(entry->pub_ids, pub->oid);
+					MemoryContextSwitchTo(oldctx);
 				}
 			}
 
@@ -2145,7 +2139,12 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 				if (list_member_oid(pubids, pub->oid) ||
 					list_member_oid(schemaPubids, pub->oid) ||
 					ancestor_published)
+				{
 					publish = true;
+					oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+					entry->pub_ids = lappend_oid(entry->pub_ids, pub->oid);
+					MemoryContextSwitchTo(oldctx);
+				}
 			}
 
 			/*
@@ -2318,6 +2317,29 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	}
 }
 
+/*
+ * Publication invalidation callback
+ */
+static void
+rel_sync_cache_publicationrel_cb(Datum arg, Oid pubid)
+{
+	HASH_SEQ_STATUS status;
+	RelationSyncEntry *entry;
+
+	if (RelationSyncCache == NULL)
+		return;
+
+	hash_seq_init(&status, RelationSyncCache);
+	while ((entry = (RelationSyncEntry *) hash_seq_search(&status)) != NULL)
+	{
+		if (entry->replicate_valid && list_member_oid(entry->pub_ids, pubid))
+		{
+			entry->replicate_valid = false;
+			entry->pub_ids = NIL;
+		}
+	}
+}
+
 /*
  * Publication relation/schema map syscache invalidation callback
  *
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 603aa4157b..a34be79ee6 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -160,6 +160,9 @@
  */
 #define CatCacheMsgs 0
 #define RelCacheMsgs 1
+#define PubCacheMsgs 2
+
+#define NumberofCache 3
 
 /* Pointers to main arrays in TopTransactionContext */
 typedef struct InvalMessageArray
@@ -168,13 +171,13 @@ typedef struct InvalMessageArray
 	int			maxmsgs;		/* current allocated size of array */
 } InvalMessageArray;
 
-static InvalMessageArray InvalMessageArrays[2];
+static InvalMessageArray InvalMessageArrays[NumberofCache];
 
 /* Control information for one logical group of messages */
 typedef struct InvalidationMsgsGroup
 {
-	int			firstmsg[2];	/* first index in relevant array */
-	int			nextmsg[2];		/* last+1 index */
+	int			firstmsg[NumberofCache];	/* first index in relevant array */
+	int			nextmsg[NumberofCache];		/* last+1 index */
 } InvalidationMsgsGroup;
 
 /* Macros to help preserve InvalidationMsgsGroup abstraction */
@@ -189,6 +192,7 @@ typedef struct InvalidationMsgsGroup
 	do { \
 		SetSubGroupToFollow(targetgroup, priorgroup, CatCacheMsgs); \
 		SetSubGroupToFollow(targetgroup, priorgroup, RelCacheMsgs); \
+		SetSubGroupToFollow(targetgroup, priorgroup, PubCacheMsgs); \
 	} while (0)
 
 #define NumMessagesInSubGroup(group, subgroup) \
@@ -196,7 +200,8 @@ typedef struct InvalidationMsgsGroup
 
 #define NumMessagesInGroup(group) \
 	(NumMessagesInSubGroup(group, CatCacheMsgs) + \
-	 NumMessagesInSubGroup(group, RelCacheMsgs))
+	 NumMessagesInSubGroup(group, RelCacheMsgs) + \
+	 NumMessagesInSubGroup(group, PubCacheMsgs))
 
 
 /*----------------
@@ -251,6 +256,7 @@ int			debug_discard_caches = 0;
 
 #define MAX_SYSCACHE_CALLBACKS 64
 #define MAX_RELCACHE_CALLBACKS 10
+#define MAX_PUBCACHE_CALLBACKS 10
 
 static struct SYSCACHECALLBACK
 {
@@ -272,6 +278,14 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static struct PUBCACHECALLBACK
+{
+	PubcacheCallbackFunction function;
+	Datum		arg;
+}			pubcache_callback_list[MAX_PUBCACHE_CALLBACKS];
+
+static int	pubcache_callback_count = 0;
+
 /* ----------------------------------------------------------------
  *				Invalidation subgroup support functions
  * ----------------------------------------------------------------
@@ -464,6 +478,38 @@ AddRelcacheInvalidationMessage(InvalidationMsgsGroup *group,
 	AddInvalidationMessage(group, RelCacheMsgs, &msg);
 }
 
+/*
+ * Add a publication inval entry
+ */
+static void
+AddPubcacheInvalidationMessage(InvalidationMsgsGroup *group,
+							   Oid dbId, Oid pubId)
+{
+	SharedInvalidationMessage msg;
+
+	/*
+	 * Don't add a duplicate item. We assume dbId need not be checked because
+	 * it will never change. InvalidOid for relId means all relations so we
+	 * don't need to add individual ones when it is present.
+	 */
+
+	ProcessMessageSubGroup(group, PubCacheMsgs,
+						   if (msg->pc.id == SHAREDINVALPUBCACHE_ID &&
+							   (msg->pc.pubId == pubId ||
+								msg->pc.pubId == InvalidOid))
+						   return);
+
+
+	/* OK, add the item */
+	msg.pc.id = SHAREDINVALPUBCACHE_ID;
+	msg.pc.dbId = dbId;
+	msg.pc.pubId = pubId;
+	/* check AddCatcacheInvalidationMessage() for an explanation */
+	VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+	AddInvalidationMessage(group, PubCacheMsgs, &msg);
+}
+
 /*
  * Add a snapshot inval entry
  *
@@ -502,6 +548,7 @@ AppendInvalidationMessages(InvalidationMsgsGroup *dest,
 {
 	AppendInvalidationMessageSubGroup(dest, src, CatCacheMsgs);
 	AppendInvalidationMessageSubGroup(dest, src, RelCacheMsgs);
+	AppendInvalidationMessageSubGroup(dest, src, PubCacheMsgs);
 }
 
 /*
@@ -516,6 +563,7 @@ ProcessInvalidationMessages(InvalidationMsgsGroup *group,
 {
 	ProcessMessageSubGroup(group, CatCacheMsgs, func(msg));
 	ProcessMessageSubGroup(group, RelCacheMsgs, func(msg));
+	ProcessMessageSubGroup(group, PubCacheMsgs, func(msg));
 }
 
 /*
@@ -528,6 +576,7 @@ ProcessInvalidationMessagesMulti(InvalidationMsgsGroup *group,
 {
 	ProcessMessageSubGroupMulti(group, CatCacheMsgs, func(msgs, n));
 	ProcessMessageSubGroupMulti(group, RelCacheMsgs, func(msgs, n));
+	ProcessMessageSubGroupMulti(group, PubCacheMsgs, func(msgs, n));
 }
 
 /* ----------------------------------------------------------------
@@ -590,6 +639,18 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 		transInvalInfo->RelcacheInitFileInval = true;
 }
 
+/*
+ * RegisterPubcacheInvalidation
+ *
+ * As above, but register a publication invalidation event.
+ */
+static void
+RegisterPubcacheInvalidation(Oid dbId, Oid pubId)
+{
+	AddPubcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
+								   dbId, pubId);
+}
+
 /*
  * RegisterSnapshotInvalidation
  *
@@ -660,6 +721,8 @@ PrepareInvalidationState(void)
 		InvalMessageArrays[CatCacheMsgs].maxmsgs = 0;
 		InvalMessageArrays[RelCacheMsgs].msgs = NULL;
 		InvalMessageArrays[RelCacheMsgs].maxmsgs = 0;
+		InvalMessageArrays[PubCacheMsgs].msgs = NULL;
+		InvalMessageArrays[PubCacheMsgs].maxmsgs = 0;
 	}
 
 	transInvalInfo = myInfo;
@@ -773,6 +836,20 @@ LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg)
 		else if (msg->sn.dbId == MyDatabaseId)
 			InvalidateCatalogSnapshot();
 	}
+	else if (msg->id == SHAREDINVALPUBCACHE_ID)
+	{
+		if (msg->pc.dbId == MyDatabaseId || msg->pc.dbId == InvalidOid)
+		{
+			int			i;
+
+			for (i = 0; i < pubcache_callback_count; i++)
+			{
+				struct PUBCACHECALLBACK *pcitem = pubcache_callback_list + i;
+
+				pcitem->function(pcitem->arg, msg->pc.pubId);
+			}
+		}
+	}
 	else
 		elog(FATAL, "unrecognized SI message ID: %d", msg->id);
 }
@@ -944,6 +1021,18 @@ xactGetCommittedInvalidationMessages(SharedInvalidationMessage **msgs,
 										msgs,
 										n * sizeof(SharedInvalidationMessage)),
 								 nmsgs += n));
+	ProcessMessageSubGroupMulti(&transInvalInfo->PriorCmdInvalidMsgs,
+								PubCacheMsgs,
+								(memcpy(msgarray + nmsgs,
+										msgs,
+										n * sizeof(SharedInvalidationMessage)),
+								 nmsgs += n));
+	ProcessMessageSubGroupMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								PubCacheMsgs,
+								(memcpy(msgarray + nmsgs,
+										msgs,
+										n * sizeof(SharedInvalidationMessage)),
+								 nmsgs += n));
 	Assert(nmsgs == nummsgs);
 
 	return nmsgs;
@@ -1312,6 +1401,17 @@ CacheInvalidateHeapTuple(Relation relation,
 		else
 			return;
 	}
+	else if (tupleRelId == PublicationRelationId)
+	{
+		Form_pg_publication pubtup = (Form_pg_publication) GETSTRUCT(tuple);
+
+		/* get publication id */
+		relationId = pubtup->oid;
+		databaseId = MyDatabaseId;
+
+		RegisterPubcacheInvalidation(databaseId, relationId);
+		return;
+	}
 	else
 		return;
 
@@ -1567,6 +1667,25 @@ CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 	++relcache_callback_count;
 }
 
+/*
+ * CacheRegisterPubcacheCallback
+ *		Register the specified function to be called for all future
+ *		publication invalidation events.  The OID of the publication being
+ *		invalidated will be passed to the function.
+ */
+void
+CacheRegisterPubcacheCallback(PubcacheCallbackFunction func,
+							  Datum arg)
+{
+	if (pubcache_callback_count >= MAX_PUBCACHE_CALLBACKS)
+		elog(FATAL, "out of pubcache_callback_list slots");
+
+	pubcache_callback_list[pubcache_callback_count].function = func;
+	pubcache_callback_list[pubcache_callback_count].arg = arg;
+
+	++pubcache_callback_count;
+}
+
 /*
  * CallSyscacheCallbacks
  *
@@ -1629,6 +1748,9 @@ LogLogicalInvalidations(void)
 		ProcessMessageSubGroupMulti(group, RelCacheMsgs,
 									XLogRegisterData((char *) msgs,
 													 n * sizeof(SharedInvalidationMessage)));
+		ProcessMessageSubGroupMulti(group, PubCacheMsgs,
+									XLogRegisterData((char *) msgs,
+													 n * sizeof(SharedInvalidationMessage)));
 		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
 	}
 }
diff --git a/src/include/storage/sinval.h b/src/include/storage/sinval.h
index 8f5744b21b..9a97268b0a 100644
--- a/src/include/storage/sinval.h
+++ b/src/include/storage/sinval.h
@@ -110,6 +110,14 @@ typedef struct
 	Oid			relId;			/* relation ID */
 } SharedInvalSnapshotMsg;
 
+#define SHAREDINVALPUBCACHE_ID	(-6)
+typedef struct
+{
+	int8		id;				/* type field --- must be first */
+	Oid			dbId;			/* database ID, or 0 if a shared relation */
+	Oid			pubId;			/* publication ID */
+} SharedInvalPubcacheMsg;
+
 typedef union
 {
 	int8		id;				/* type field --- must be first */
@@ -119,6 +127,7 @@ typedef union
 	SharedInvalSmgrMsg sm;
 	SharedInvalRelmapMsg rm;
 	SharedInvalSnapshotMsg sn;
+	SharedInvalPubcacheMsg pc;
 } SharedInvalidationMessage;
 
 
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 24695facf2..66d27b8bee 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -22,6 +22,7 @@ extern PGDLLIMPORT int debug_discard_caches;
 
 typedef void (*SyscacheCallbackFunction) (Datum arg, int cacheid, uint32 hashvalue);
 typedef void (*RelcacheCallbackFunction) (Datum arg, Oid relid);
+typedef void (*PubcacheCallbackFunction) (Datum arg, Oid pubid);
 
 
 extern void AcceptInvalidationMessages(void);
@@ -59,6 +60,9 @@ extern void CacheRegisterSyscacheCallback(int cacheid,
 extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 										  Datum arg);
 
+extern void CacheRegisterPubcacheCallback(PubcacheCallbackFunction func,
+										  Datum arg);
+
 extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 
 extern void InvalidateSystemCaches(void);
-- 
2.34.1

#61

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Hayato Kuroda (Fujitsu) (#59)

Re: long-standing data loss bug in initial sync of logical replication

On Mon, 30 Sept 2024 at 16:56, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

2.
Similarly with above, the relcache won't be invalidated when ALTER
PUBLICATION
OWNER TO is executed. This means that privilage checks may be ignored if the
entry
is still valid. Not sure, but is there a possibility this causes an inconsistency?

Hmm, IIUC, the attribute pubowner is not used for now. The paragpargh
"There are currently no privileges on publications...." [1] may show the current
status. However, to keep the current behavior, I suggest to invalidate the relcache
of pubrelations when the owner is altered.

[1]: https://www.postgresql.org/docs/devel/logical-replication-security.html

I have shared the updated patch [1]/messages/by-id/CANhcyEWEXL3rxvKH9-Xtx-DgGX0D62EktHpW+nG+MSSaMVUVig@mail.gmail.com.
So now, when 'ALTER .. PUBLICATION .. OWNER TO ..' is executed the
relcache is invalidated for that specific publication.

[1]: /messages/by-id/CANhcyEWEXL3rxvKH9-Xtx-DgGX0D62EktHpW+nG+MSSaMVUVig@mail.gmail.com

Thanks and Regards,
Shlok Kyal

#62

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 1 year ago

In reply to: Shlok Kyal (#60)

1 attachment(s)

RE: long-standing data loss bug in initial sync of logical replication

Dear Shlok-san,

Thanks for updating the patch. Here are comments.

1.
I feel the name of SnapBuildDistributeNewCatalogSnapshot() should be updated because it
distributes two objects: catalog snapshot and invalidation messages. Do you have good one
in your mind? I considered "SnapBuildDistributeNewCatalogSnapshotAndInValidations" or
"SnapBuildDistributeItems" but seems not good :-(.

2.
Hmm, still, it is overengineering for me to add a new type of invalidation message
only for the publication. According to the ExecRenameStmt() we can implement an
arbitrary rename function like RenameConstraint() and RenameDatabase().
Regaring the ALTER PUBLICATION OWNER TO, I feel adding CacheInvalidateRelcacheAll()
and InvalidatePublicationRels() is enough.

I attached a PoC which implements above. It could pass tests on my env. Could you
please see it tell me how you think?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

add_invalidations.diffsapplication/octet-stream; name=add_invalidations.diffsDownload

diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 4f99ebb447..395fe530b3 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -399,6 +399,9 @@ ExecRenameStmt(RenameStmt *stmt)
 		case OBJECT_TYPE:
 			return RenameType(stmt);
 
+		case OBJECT_PUBLICATION:
+			return RenamePublication(stmt->subname, stmt->newname);
+
 		case OBJECT_AGGREGATE:
 		case OBJECT_COLLATION:
 		case OBJECT_CONVERSION:
@@ -416,7 +419,6 @@ ExecRenameStmt(RenameStmt *stmt)
 		case OBJECT_TSDICTIONARY:
 		case OBJECT_TSPARSER:
 		case OBJECT_TSTEMPLATE:
-		case OBJECT_PUBLICATION:
 		case OBJECT_SUBSCRIPTION:
 			{
 				ObjectAddress address;
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index d6ffef374e..86b270d1cf 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -433,6 +433,85 @@ pub_collist_contains_invalid_column(Oid pubid, Relation relation, List *ancestor
 	return result;
 }
 
+/*
+ * Execute ALTER PUBLICATION RENAME
+ */
+ObjectAddress
+RenamePublication(const char *oldname, const char *newname)
+{
+	Relation			rel;
+	HeapTuple			tup;
+	ObjectAddress		address;
+	Form_pg_publication	pubform;
+	bool				replaces[Natts_pg_publication];
+	bool				nulls[Natts_pg_publication];
+	Datum				values[Natts_pg_publication];
+
+	rel = table_open(PublicationRelationId, RowExclusiveLock);
+
+	tup = SearchSysCacheCopy1(PUBLICATIONNAME,
+							  CStringGetDatum(oldname));
+
+	if (!HeapTupleIsValid(tup))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_OBJECT),
+				 errmsg("publication \"%s\" does not exist",
+						oldname)));
+
+	pubform = (Form_pg_publication) GETSTRUCT(tup);
+
+	/* must be owner */
+	if (!object_ownercheck(PublicationRelationId, pubform->oid, GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER, OBJECT_PUBLICATION,
+					   NameStr(pubform->pubname));
+
+	/* Everything ok, form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* Only update the pubname */
+	values[Anum_pg_publication_pubname - 1] =
+		DirectFunctionCall1(namein, CStringGetDatum(newname));
+	replaces[Anum_pg_publication_pubname - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel), values, nulls,
+							replaces);
+
+	/* Invalidate the relcache. */
+	if (pubform->puballtables)
+	{
+		CacheInvalidateRelcacheAll();
+	}
+	else
+	{
+		List	   *relids = NIL;
+		List	   *schemarelids = NIL;
+
+		/*
+		 * XXX: all tables in the tree is listed now, but this may be too much.
+		 */
+		relids = GetPublicationRelations(pubform->oid,
+										 PUBLICATION_PART_ALL);
+		schemarelids = GetAllSchemaPublicationRelations(pubform->oid,
+														PUBLICATION_PART_ALL);
+
+		relids = list_concat_unique_oid(relids, schemarelids);
+
+		InvalidatePublicationRels(relids);
+	}
+
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	ObjectAddressSet(address, PublicationRelationId, pubform->oid);
+
+	heap_freetuple(tup);
+
+	table_close(rel, RowExclusiveLock);
+
+	return address;
+}
+
 /* check_functions_in_node callback */
 static bool
 contain_mutable_or_user_functions_checker(Oid func_id, void *context)
@@ -1920,6 +1999,30 @@ AlterPublicationOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
 	}
 
 	form->pubowner = newOwnerId;
+
+	/* Invalidate the relcache. */
+	if (form->puballtables)
+	{
+		CacheInvalidateRelcacheAll();
+	}
+	else
+	{
+		List	   *relids = NIL;
+		List	   *schemarelids = NIL;
+
+		/*
+		 * XXX: all tables in the tree is listed now, but this may be too much.
+		 */
+		relids = GetPublicationRelations(form->oid,
+										 PUBLICATION_PART_ALL);
+		schemarelids = GetAllSchemaPublicationRelations(form->oid,
+														PUBLICATION_PART_ALL);
+
+		relids = list_concat_unique_oid(relids, schemarelids);
+
+		InvalidatePublicationRels(relids);
+	}
+
 	CatalogTupleUpdate(rel, &tup->t_self, tup);
 
 	/* Update owner dependency reference */
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 4aa8646af7..ec10bfdd8c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -9466,7 +9466,7 @@ RenameStmt: ALTER AGGREGATE aggregate_with_argtypes RENAME TO name
 					RenameStmt *n = makeNode(RenameStmt);
 
 					n->renameType = OBJECT_PUBLICATION;
-					n->object = (Node *) makeString($3);
+					n->subname = $3;
 					n->newname = $6;
 					n->missing_ok = false;
 					$$ = (Node *) n;
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1d80d27d0f..b280532a3a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -218,7 +218,6 @@ static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data,
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
-static void rel_sync_cache_publicationrel_cb(Datum arg, Oid pubid);
 static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
 											TransactionId xid);
 static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
@@ -1917,8 +1916,6 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  rel_sync_cache_publication_cb,
 								  (Datum) 0);
 
-	CacheRegisterPubcacheCallback(rel_sync_cache_publicationrel_cb, (Datum) 0);
-
 	relation_callbacks_registered = true;
 }
 
@@ -2317,29 +2314,6 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	}
 }
 
-/*
- * Publication invalidation callback
- */
-static void
-rel_sync_cache_publicationrel_cb(Datum arg, Oid pubid)
-{
-	HASH_SEQ_STATUS status;
-	RelationSyncEntry *entry;
-
-	if (RelationSyncCache == NULL)
-		return;
-
-	hash_seq_init(&status, RelationSyncCache);
-	while ((entry = (RelationSyncEntry *) hash_seq_search(&status)) != NULL)
-	{
-		if (entry->replicate_valid && list_member_oid(entry->pub_ids, pubid))
-		{
-			entry->replicate_valid = false;
-			entry->pub_ids = NIL;
-		}
-	}
-}
-
 /*
  * Publication relation/schema map syscache invalidation callback
  *
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index a34be79ee6..53f00dfd13 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -160,9 +160,6 @@
  */
 #define CatCacheMsgs 0
 #define RelCacheMsgs 1
-#define PubCacheMsgs 2
-
-#define NumberofCache 3
 
 /* Pointers to main arrays in TopTransactionContext */
 typedef struct InvalMessageArray
@@ -171,13 +168,13 @@ typedef struct InvalMessageArray
 	int			maxmsgs;		/* current allocated size of array */
 } InvalMessageArray;
 
-static InvalMessageArray InvalMessageArrays[NumberofCache];
+static InvalMessageArray InvalMessageArrays[2];
 
 /* Control information for one logical group of messages */
 typedef struct InvalidationMsgsGroup
 {
-	int			firstmsg[NumberofCache];	/* first index in relevant array */
-	int			nextmsg[NumberofCache];		/* last+1 index */
+	int			firstmsg[2];	/* first index in relevant array */
+	int			nextmsg[2];		/* last+1 index */
 } InvalidationMsgsGroup;
 
 /* Macros to help preserve InvalidationMsgsGroup abstraction */
@@ -192,7 +189,6 @@ typedef struct InvalidationMsgsGroup
 	do { \
 		SetSubGroupToFollow(targetgroup, priorgroup, CatCacheMsgs); \
 		SetSubGroupToFollow(targetgroup, priorgroup, RelCacheMsgs); \
-		SetSubGroupToFollow(targetgroup, priorgroup, PubCacheMsgs); \
 	} while (0)
 
 #define NumMessagesInSubGroup(group, subgroup) \
@@ -200,9 +196,7 @@ typedef struct InvalidationMsgsGroup
 
 #define NumMessagesInGroup(group) \
 	(NumMessagesInSubGroup(group, CatCacheMsgs) + \
-	 NumMessagesInSubGroup(group, RelCacheMsgs) + \
-	 NumMessagesInSubGroup(group, PubCacheMsgs))
-
+	 NumMessagesInSubGroup(group, RelCacheMsgs))
 
 /*----------------
  * Invalidation messages are divided into two groups:
@@ -256,7 +250,6 @@ int			debug_discard_caches = 0;
 
 #define MAX_SYSCACHE_CALLBACKS 64
 #define MAX_RELCACHE_CALLBACKS 10
-#define MAX_PUBCACHE_CALLBACKS 10
 
 static struct SYSCACHECALLBACK
 {
@@ -278,14 +271,6 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
-static struct PUBCACHECALLBACK
-{
-	PubcacheCallbackFunction function;
-	Datum		arg;
-}			pubcache_callback_list[MAX_PUBCACHE_CALLBACKS];
-
-static int	pubcache_callback_count = 0;
-
 /* ----------------------------------------------------------------
  *				Invalidation subgroup support functions
  * ----------------------------------------------------------------
@@ -478,38 +463,6 @@ AddRelcacheInvalidationMessage(InvalidationMsgsGroup *group,
 	AddInvalidationMessage(group, RelCacheMsgs, &msg);
 }
 
-/*
- * Add a publication inval entry
- */
-static void
-AddPubcacheInvalidationMessage(InvalidationMsgsGroup *group,
-							   Oid dbId, Oid pubId)
-{
-	SharedInvalidationMessage msg;
-
-	/*
-	 * Don't add a duplicate item. We assume dbId need not be checked because
-	 * it will never change. InvalidOid for relId means all relations so we
-	 * don't need to add individual ones when it is present.
-	 */
-
-	ProcessMessageSubGroup(group, PubCacheMsgs,
-						   if (msg->pc.id == SHAREDINVALPUBCACHE_ID &&
-							   (msg->pc.pubId == pubId ||
-								msg->pc.pubId == InvalidOid))
-						   return);
-
-
-	/* OK, add the item */
-	msg.pc.id = SHAREDINVALPUBCACHE_ID;
-	msg.pc.dbId = dbId;
-	msg.pc.pubId = pubId;
-	/* check AddCatcacheInvalidationMessage() for an explanation */
-	VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
-
-	AddInvalidationMessage(group, PubCacheMsgs, &msg);
-}
-
 /*
  * Add a snapshot inval entry
  *
@@ -548,7 +501,6 @@ AppendInvalidationMessages(InvalidationMsgsGroup *dest,
 {
 	AppendInvalidationMessageSubGroup(dest, src, CatCacheMsgs);
 	AppendInvalidationMessageSubGroup(dest, src, RelCacheMsgs);
-	AppendInvalidationMessageSubGroup(dest, src, PubCacheMsgs);
 }
 
 /*
@@ -563,7 +515,6 @@ ProcessInvalidationMessages(InvalidationMsgsGroup *group,
 {
 	ProcessMessageSubGroup(group, CatCacheMsgs, func(msg));
 	ProcessMessageSubGroup(group, RelCacheMsgs, func(msg));
-	ProcessMessageSubGroup(group, PubCacheMsgs, func(msg));
 }
 
 /*
@@ -576,7 +527,6 @@ ProcessInvalidationMessagesMulti(InvalidationMsgsGroup *group,
 {
 	ProcessMessageSubGroupMulti(group, CatCacheMsgs, func(msgs, n));
 	ProcessMessageSubGroupMulti(group, RelCacheMsgs, func(msgs, n));
-	ProcessMessageSubGroupMulti(group, PubCacheMsgs, func(msgs, n));
 }
 
 /* ----------------------------------------------------------------
@@ -639,18 +589,6 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 		transInvalInfo->RelcacheInitFileInval = true;
 }
 
-/*
- * RegisterPubcacheInvalidation
- *
- * As above, but register a publication invalidation event.
- */
-static void
-RegisterPubcacheInvalidation(Oid dbId, Oid pubId)
-{
-	AddPubcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
-								   dbId, pubId);
-}
-
 /*
  * RegisterSnapshotInvalidation
  *
@@ -721,8 +659,6 @@ PrepareInvalidationState(void)
 		InvalMessageArrays[CatCacheMsgs].maxmsgs = 0;
 		InvalMessageArrays[RelCacheMsgs].msgs = NULL;
 		InvalMessageArrays[RelCacheMsgs].maxmsgs = 0;
-		InvalMessageArrays[PubCacheMsgs].msgs = NULL;
-		InvalMessageArrays[PubCacheMsgs].maxmsgs = 0;
 	}
 
 	transInvalInfo = myInfo;
@@ -836,20 +772,6 @@ LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg)
 		else if (msg->sn.dbId == MyDatabaseId)
 			InvalidateCatalogSnapshot();
 	}
-	else if (msg->id == SHAREDINVALPUBCACHE_ID)
-	{
-		if (msg->pc.dbId == MyDatabaseId || msg->pc.dbId == InvalidOid)
-		{
-			int			i;
-
-			for (i = 0; i < pubcache_callback_count; i++)
-			{
-				struct PUBCACHECALLBACK *pcitem = pubcache_callback_list + i;
-
-				pcitem->function(pcitem->arg, msg->pc.pubId);
-			}
-		}
-	}
 	else
 		elog(FATAL, "unrecognized SI message ID: %d", msg->id);
 }
@@ -1021,18 +943,6 @@ xactGetCommittedInvalidationMessages(SharedInvalidationMessage **msgs,
 										msgs,
 										n * sizeof(SharedInvalidationMessage)),
 								 nmsgs += n));
-	ProcessMessageSubGroupMulti(&transInvalInfo->PriorCmdInvalidMsgs,
-								PubCacheMsgs,
-								(memcpy(msgarray + nmsgs,
-										msgs,
-										n * sizeof(SharedInvalidationMessage)),
-								 nmsgs += n));
-	ProcessMessageSubGroupMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
-								PubCacheMsgs,
-								(memcpy(msgarray + nmsgs,
-										msgs,
-										n * sizeof(SharedInvalidationMessage)),
-								 nmsgs += n));
 	Assert(nmsgs == nummsgs);
 
 	return nmsgs;
@@ -1401,17 +1311,6 @@ CacheInvalidateHeapTuple(Relation relation,
 		else
 			return;
 	}
-	else if (tupleRelId == PublicationRelationId)
-	{
-		Form_pg_publication pubtup = (Form_pg_publication) GETSTRUCT(tuple);
-
-		/* get publication id */
-		relationId = pubtup->oid;
-		databaseId = MyDatabaseId;
-
-		RegisterPubcacheInvalidation(databaseId, relationId);
-		return;
-	}
 	else
 		return;
 
@@ -1667,25 +1566,6 @@ CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 	++relcache_callback_count;
 }
 
-/*
- * CacheRegisterPubcacheCallback
- *		Register the specified function to be called for all future
- *		publication invalidation events.  The OID of the publication being
- *		invalidated will be passed to the function.
- */
-void
-CacheRegisterPubcacheCallback(PubcacheCallbackFunction func,
-							  Datum arg)
-{
-	if (pubcache_callback_count >= MAX_PUBCACHE_CALLBACKS)
-		elog(FATAL, "out of pubcache_callback_list slots");
-
-	pubcache_callback_list[pubcache_callback_count].function = func;
-	pubcache_callback_list[pubcache_callback_count].arg = arg;
-
-	++pubcache_callback_count;
-}
-
 /*
  * CallSyscacheCallbacks
  *
@@ -1748,9 +1628,6 @@ LogLogicalInvalidations(void)
 		ProcessMessageSubGroupMulti(group, RelCacheMsgs,
 									XLogRegisterData((char *) msgs,
 													 n * sizeof(SharedInvalidationMessage)));
-		ProcessMessageSubGroupMulti(group, PubCacheMsgs,
-									XLogRegisterData((char *) msgs,
-													 n * sizeof(SharedInvalidationMessage)));
 		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
 	}
 }
diff --git a/src/include/commands/publicationcmds.h b/src/include/commands/publicationcmds.h
index 5487c571f6..b953193812 100644
--- a/src/include/commands/publicationcmds.h
+++ b/src/include/commands/publicationcmds.h
@@ -35,5 +35,6 @@ extern bool pub_rf_contains_invalid_column(Oid pubid, Relation relation,
 										   List *ancestors, bool pubviaroot);
 extern bool pub_collist_contains_invalid_column(Oid pubid, Relation relation,
 												List *ancestors, bool pubviaroot);
+extern ObjectAddress RenamePublication(const char *oldname, const char *newname);
 
 #endif							/* PUBLICATIONCMDS_H */
diff --git a/src/include/storage/sinval.h b/src/include/storage/sinval.h
index 9a97268b0a..8f5744b21b 100644
--- a/src/include/storage/sinval.h
+++ b/src/include/storage/sinval.h
@@ -110,14 +110,6 @@ typedef struct
 	Oid			relId;			/* relation ID */
 } SharedInvalSnapshotMsg;
 
-#define SHAREDINVALPUBCACHE_ID	(-6)
-typedef struct
-{
-	int8		id;				/* type field --- must be first */
-	Oid			dbId;			/* database ID, or 0 if a shared relation */
-	Oid			pubId;			/* publication ID */
-} SharedInvalPubcacheMsg;
-
 typedef union
 {
 	int8		id;				/* type field --- must be first */
@@ -127,7 +119,6 @@ typedef union
 	SharedInvalSmgrMsg sm;
 	SharedInvalRelmapMsg rm;
 	SharedInvalSnapshotMsg sn;
-	SharedInvalPubcacheMsg pc;
 } SharedInvalidationMessage;
 
 
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 66d27b8bee..24695facf2 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -22,7 +22,6 @@ extern PGDLLIMPORT int debug_discard_caches;
 
 typedef void (*SyscacheCallbackFunction) (Datum arg, int cacheid, uint32 hashvalue);
 typedef void (*RelcacheCallbackFunction) (Datum arg, Oid relid);
-typedef void (*PubcacheCallbackFunction) (Datum arg, Oid pubid);
 
 
 extern void AcceptInvalidationMessages(void);
@@ -60,9 +59,6 @@ extern void CacheRegisterSyscacheCallback(int cacheid,
 extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 										  Datum arg);
 
-extern void CacheRegisterPubcacheCallback(PubcacheCallbackFunction func,
-										  Datum arg);
-
 extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 
 extern void InvalidateSystemCaches(void);

#63

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Hayato Kuroda (Fujitsu) (#62)

2 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

Hi Kuroda-san,

Thanks for reviewing the patch.

1.
I feel the name of SnapBuildDistributeNewCatalogSnapshot() should be updated because it
distributes two objects: catalog snapshot and invalidation messages. Do you have good one
in your mind? I considered "SnapBuildDistributeNewCatalogSnapshotAndInValidations" or
"SnapBuildDistributeItems" but seems not good :-(.

I have renamed the function to 'SnapBuildDistributeSnapshotAndInval'. Thoughts?

2.
Hmm, still, it is overengineering for me to add a new type of invalidation message
only for the publication. According to the ExecRenameStmt() we can implement an
arbitrary rename function like RenameConstraint() and RenameDatabase().
Regaring the ALTER PUBLICATION OWNER TO, I feel adding CacheInvalidateRelcacheAll()
and InvalidatePublicationRels() is enough.

I agree with you.

I attached a PoC which implements above. It could pass tests on my env. Could you
please see it tell me how you think?

I have tested the POC and it is working as expected. The changes look
fine to me. I have created a patch for the same.
Currently, we are passing 'PUBLICATION_PART_ALL' as an argument to
function 'GetPublicationRelations' and
'GetAllSchemaPublicationRelations'. Need to check if we can use
'PUBLICATION_PART_ROOT' or 'PUBLICATION_PART_LEAF' depending on the
'publish_via_partition_root' option. Will test and address this in the
next version of the patch. For now, I have added a TODO.

Thanks and Regards,
Shlok Kyal

Attachments:

v12-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchapplication/octet-stream; name=v12-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchDownload

From 2969ff336fbd1bd764770a9e8314fcb62a6f1eab Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v12 1/2] Distribute invalidatons if change in catalog tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  34 +-
 src/include/replication/reorderbuffer.h       |   4 +
 src/test/subscription/t/100_bugs.pl           | 291 ++++++++++++++++++
 4 files changed, 320 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 22bcf171ff..c5dfc1ab06 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -221,9 +221,6 @@ int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED
  */
 static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
 static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -622,7 +619,7 @@ ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0450f94ba8..1f7c24cad0 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -300,7 +300,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -859,18 +859,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -913,6 +916,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1184,8 +1195,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e332635f70..093d21213a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -743,6 +743,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
+											   TransactionId xid, bool create, bool *is_new,
+											   XLogRecPtr lsn, bool create_as_top);
+
 extern void StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..bdabe53e42 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,297 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', qq(DROP  PUBLICATION pub1;));
+$node_subscriber->safe_psql('postgres', qq(DROP  SUBSCRIPTION sub1;));
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+]);
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+]);
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	qq[
+	ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3;
+]);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION');
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even when
+# tables are dropped from the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Maintain an active transaction with the table that will be dropped from the
+# publication.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+]);
+
+# Maintain an active transaction with a schema table that will be dropped from the
+# publication.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (4);
+]);
+
+# Drop the table from the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	qq[
+	ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc, TABLES IN SCHEMA sch3;
+]);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+	INSERT INTO sch3.tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is not replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that data for table tab_conc are not replicated to subscriber');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table sch3.tab_conc are not replicated to subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is dropped in a concurrent active transaction.
+
+# Add tables to the publication.
+$background_psql3->query_safe(
+	qq[
+	ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3;
+]);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (6);
+]);
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (6);
+]);
+
+# Drop publication.
+$background_psql3->query_safe(
+	qq[
+	DROP PUBLICATION regress_pub1;
+]);
+
+# Perform an insert.
+$background_psql1->query_safe(
+	qq[
+	INSERT INTO tab_conc VALUES (7);
+]);
+
+$background_psql2->query_safe(
+	qq[
+	INSERT INTO sch3.tab_conc VALUES (7);
+]);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+# ERROR should appear on subscriber.
+my $offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$node_subscriber->safe_psql('postgres', 'DROP SUBSCRIPTION regress_sub1;');
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is renamed in a concurrent active transaction.
+
+# Create publication.
+$background_psql3->query_safe(
+	qq[
+	CREATE PUBLICATION regress_pub1 FOR TABLE tab_conc, TABLES IN SCHEMA sch3;
+]);
+
+# Create subscription.
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO tab_conc VALUES (8);
+]);
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq[
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (8);
+]);
+
+# Rename publication.
+$background_psql3->query_safe(
+	qq[
+	ALTER PUBLICATION regress_pub1 RENAME TO regress_pub1_rename;
+]);
+
+# Perform an insert.
+$background_psql1->query_safe(
+	qq[
+	INSERT INTO tab_conc VALUES (9);
+]);
+
+$background_psql2->query_safe(
+	qq[
+	INSERT INTO sch3.tab_conc VALUES (9);
+]);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe(qq[COMMIT]);
+$background_psql2->query_safe(qq[COMMIT]);
+
+# ERROR should appear on subscriber.
+$offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.34.1

v12-0002-Selective-Invalidation-of-Cache.patchapplication/octet-stream; name=v12-0002-Selective-Invalidation-of-Cache.patchDownload

From edd4564e5750623b2f965da0ed5130cc09ad5388 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 4 Oct 2024 12:25:31 +0530
Subject: [PATCH v12 2/2] Selective Invalidation of Cache

When we alter a publication, add/drop namespace to/from publication
all the cache for all the tables are invalidated.
With this patch for the above operationns we will invalidate the
cache of only the desired tables.
---
 src/backend/commands/alter.c                |   4 +-
 src/backend/commands/publicationcmds.c      | 105 ++++++++++++++++++++
 src/backend/parser/gram.y                   |   2 +-
 src/backend/replication/pgoutput/pgoutput.c |  18 ----
 src/include/commands/publicationcmds.h      |   1 +
 5 files changed, 110 insertions(+), 20 deletions(-)

diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 4f99ebb447..395fe530b3 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -399,6 +399,9 @@ ExecRenameStmt(RenameStmt *stmt)
 		case OBJECT_TYPE:
 			return RenameType(stmt);
 
+		case OBJECT_PUBLICATION:
+			return RenamePublication(stmt->subname, stmt->newname);
+
 		case OBJECT_AGGREGATE:
 		case OBJECT_COLLATION:
 		case OBJECT_CONVERSION:
@@ -416,7 +419,6 @@ ExecRenameStmt(RenameStmt *stmt)
 		case OBJECT_TSDICTIONARY:
 		case OBJECT_TSPARSER:
 		case OBJECT_TSTEMPLATE:
-		case OBJECT_PUBLICATION:
 		case OBJECT_SUBSCRIPTION:
 			{
 				ObjectAddress address;
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index d6ffef374e..ae06aa254b 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -433,6 +433,86 @@ pub_collist_contains_invalid_column(Oid pubid, Relation relation, List *ancestor
 	return result;
 }
 
+/*
+ * Execute ALTER PUBLICATION RENAME
+ */
+ObjectAddress
+RenamePublication(const char *oldname, const char *newname)
+{
+	Relation			rel;
+	HeapTuple			tup;
+	ObjectAddress		address;
+	Form_pg_publication	pubform;
+	bool				replaces[Natts_pg_publication];
+	bool				nulls[Natts_pg_publication];
+	Datum				values[Natts_pg_publication];
+
+	rel = table_open(PublicationRelationId, RowExclusiveLock);
+
+	tup = SearchSysCacheCopy1(PUBLICATIONNAME,
+							  CStringGetDatum(oldname));
+
+	if (!HeapTupleIsValid(tup))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_OBJECT),
+				 errmsg("publication \"%s\" does not exist",
+						oldname)));
+
+	pubform = (Form_pg_publication) GETSTRUCT(tup);
+
+	/* must be owner */
+	if (!object_ownercheck(PublicationRelationId, pubform->oid, GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER, OBJECT_PUBLICATION,
+					   NameStr(pubform->pubname));
+
+	/* Everything ok, form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* Only update the pubname */
+	values[Anum_pg_publication_pubname - 1] =
+		DirectFunctionCall1(namein, CStringGetDatum(newname));
+	replaces[Anum_pg_publication_pubname - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel), values, nulls,
+							replaces);
+
+	/* Invalidate the relcache. */
+	if (pubform->puballtables)
+	{
+		CacheInvalidateRelcacheAll();
+	}
+	else
+	{
+		List	   *relids = NIL;
+		List	   *schemarelids = NIL;
+
+		/*
+		 * XXX: Should we pass different pub_partops depending on
+		 * 'publish_via_partition_root'
+		 */
+		relids = GetPublicationRelations(pubform->oid,
+										 PUBLICATION_PART_ALL);
+		schemarelids = GetAllSchemaPublicationRelations(pubform->oid,
+														PUBLICATION_PART_ALL);
+
+		relids = list_concat_unique_oid(relids, schemarelids);
+
+		InvalidatePublicationRels(relids);
+	}
+
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	ObjectAddressSet(address, PublicationRelationId, pubform->oid);
+
+	heap_freetuple(tup);
+
+	table_close(rel, RowExclusiveLock);
+
+	return address;
+}
+
 /* check_functions_in_node callback */
 static bool
 contain_mutable_or_user_functions_checker(Oid func_id, void *context)
@@ -1920,6 +2000,31 @@ AlterPublicationOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
 	}
 
 	form->pubowner = newOwnerId;
+
+	/* Invalidate the relcache. */
+	if (form->puballtables)
+	{
+		CacheInvalidateRelcacheAll();
+	}
+	else
+	{
+		List	   *relids = NIL;
+		List	   *schemarelids = NIL;
+
+		/*
+		 * XXX: Should we pass different pub_partops depending on
+		 * 'publish_via_partition_root'
+		 */
+		relids = GetPublicationRelations(form->oid,
+										 PUBLICATION_PART_ALL);
+		schemarelids = GetAllSchemaPublicationRelations(form->oid,
+														PUBLICATION_PART_ALL);
+
+		relids = list_concat_unique_oid(relids, schemarelids);
+
+		InvalidatePublicationRels(relids);
+	}
+
 	CatalogTupleUpdate(rel, &tup->t_self, tup);
 
 	/* Update owner dependency reference */
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 4aa8646af7..ec10bfdd8c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -9466,7 +9466,7 @@ RenameStmt: ALTER AGGREGATE aggregate_with_argtypes RENAME TO name
 					RenameStmt *n = makeNode(RenameStmt);
 
 					n->renameType = OBJECT_PUBLICATION;
-					n->object = (Node *) makeString($3);
+					n->subname = $3;
 					n->newname = $6;
 					n->missing_ok = false;
 					$$ = (Node *) n;
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 00e7024563..b8429be8cf 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -1739,12 +1739,6 @@ static void
 publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 {
 	publications_valid = false;
-
-	/*
-	 * Also invalidate per-relation cache so that next time the filtering info
-	 * is checked it will be updated with the new publication settings.
-	 */
-	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
 /*
@@ -1920,18 +1914,6 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  rel_sync_cache_publication_cb,
 								  (Datum) 0);
 
-	/*
-	 * Flush all cache entries after any publication changes.  (We need no
-	 * callback entry for pg_publication, because publication_invalidation_cb
-	 * will take care of it.)
-	 */
-	CacheRegisterSyscacheCallback(PUBLICATIONRELMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
-	CacheRegisterSyscacheCallback(PUBLICATIONNAMESPACEMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
-
 	relation_callbacks_registered = true;
 }
 
diff --git a/src/include/commands/publicationcmds.h b/src/include/commands/publicationcmds.h
index 5487c571f6..b953193812 100644
--- a/src/include/commands/publicationcmds.h
+++ b/src/include/commands/publicationcmds.h
@@ -35,5 +35,6 @@ extern bool pub_rf_contains_invalid_column(Oid pubid, Relation relation,
 										   List *ancestors, bool pubviaroot);
 extern bool pub_collist_contains_invalid_column(Oid pubid, Relation relation,
 												List *ancestors, bool pubviaroot);
+extern ObjectAddress RenamePublication(const char *oldname, const char *newname);
 
 #endif							/* PUBLICATIONCMDS_H */
-- 
2.34.1

#64

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Shlok Kyal (#63)

2 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Fri, 4 Oct 2024 at 12:52, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

Hi Kuroda-san,

Thanks for reviewing the patch.

1.
I feel the name of SnapBuildDistributeNewCatalogSnapshot() should be updated because it
distributes two objects: catalog snapshot and invalidation messages. Do you have good one
in your mind? I considered "SnapBuildDistributeNewCatalogSnapshotAndInValidations" or
"SnapBuildDistributeItems" but seems not good :-(.

I have renamed the function to 'SnapBuildDistributeSnapshotAndInval'. Thoughts?

2.
Hmm, still, it is overengineering for me to add a new type of invalidation message
only for the publication. According to the ExecRenameStmt() we can implement an
arbitrary rename function like RenameConstraint() and RenameDatabase().
Regaring the ALTER PUBLICATION OWNER TO, I feel adding CacheInvalidateRelcacheAll()
and InvalidatePublicationRels() is enough.

I agree with you.

I attached a PoC which implements above. It could pass tests on my env. Could you
please see it tell me how you think?

I have tested the POC and it is working as expected. The changes look
fine to me. I have created a patch for the same.
Currently, we are passing 'PUBLICATION_PART_ALL' as an argument to
function 'GetPublicationRelations' and
'GetAllSchemaPublicationRelations'. Need to check if we can use
'PUBLICATION_PART_ROOT' or 'PUBLICATION_PART_LEAF' depending on the
'publish_via_partition_root' option. Will test and address this in the
next version of the patch. For now, I have added a TODO.

I have tested this part. I observed that ,whenever we insert data in a
partition table, the function 'get_rel_sync_entry' is called and a
hash entry is created for the corresponding leaf node relid. So I feel
while invalidating here we can specify 'PUBLICATION_PART_LEAF' . I
have made the corresponding changes 0002 patch.

I have also modified the tests in 0001 patch. These changes are only
related to syntax of writing tests.

Thanks and Regards,
Shlok Kyal

Attachments:

v13-0002-Selective-Invalidation-of-Cache.patchapplication/octet-stream; name=v13-0002-Selective-Invalidation-of-Cache.patchDownload

From 0e422f51be8da9cc9f91233a5f7ff1321d515b40 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 4 Oct 2024 12:25:31 +0530
Subject: [PATCH v13 2/2] Selective Invalidation of Cache

When we alter a publication, add/drop namespace to/from publication
all the cache for all the tables are invalidated.
With this patch for the above operationns we will invalidate the
cache of only the desired tables.
---
 src/backend/commands/alter.c                |   4 +-
 src/backend/commands/publicationcmds.c      | 107 ++++++++++++++++++++
 src/backend/parser/gram.y                   |   2 +-
 src/backend/replication/pgoutput/pgoutput.c |  18 ----
 src/include/commands/publicationcmds.h      |   1 +
 5 files changed, 112 insertions(+), 20 deletions(-)

diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 4f99ebb447..395fe530b3 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -399,6 +399,9 @@ ExecRenameStmt(RenameStmt *stmt)
 		case OBJECT_TYPE:
 			return RenameType(stmt);
 
+		case OBJECT_PUBLICATION:
+			return RenamePublication(stmt->subname, stmt->newname);
+
 		case OBJECT_AGGREGATE:
 		case OBJECT_COLLATION:
 		case OBJECT_CONVERSION:
@@ -416,7 +419,6 @@ ExecRenameStmt(RenameStmt *stmt)
 		case OBJECT_TSDICTIONARY:
 		case OBJECT_TSPARSER:
 		case OBJECT_TSTEMPLATE:
-		case OBJECT_PUBLICATION:
 		case OBJECT_SUBSCRIPTION:
 			{
 				ObjectAddress address;
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index d6ffef374e..b70091a3c6 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -433,6 +433,87 @@ pub_collist_contains_invalid_column(Oid pubid, Relation relation, List *ancestor
 	return result;
 }
 
+/*
+ * Execute ALTER PUBLICATION RENAME
+ */
+ObjectAddress
+RenamePublication(const char *oldname, const char *newname)
+{
+	Relation			rel;
+	HeapTuple			tup;
+	ObjectAddress		address;
+	Form_pg_publication	pubform;
+	bool				replaces[Natts_pg_publication];
+	bool				nulls[Natts_pg_publication];
+	Datum				values[Natts_pg_publication];
+
+	rel = table_open(PublicationRelationId, RowExclusiveLock);
+
+	tup = SearchSysCacheCopy1(PUBLICATIONNAME,
+							  CStringGetDatum(oldname));
+
+	if (!HeapTupleIsValid(tup))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_OBJECT),
+				 errmsg("publication \"%s\" does not exist",
+						oldname)));
+
+	pubform = (Form_pg_publication) GETSTRUCT(tup);
+
+	/* must be owner */
+	if (!object_ownercheck(PublicationRelationId, pubform->oid, GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER, OBJECT_PUBLICATION,
+					   NameStr(pubform->pubname));
+
+	/* Everything ok, form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* Only update the pubname */
+	values[Anum_pg_publication_pubname - 1] =
+		DirectFunctionCall1(namein, CStringGetDatum(newname));
+	replaces[Anum_pg_publication_pubname - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel), values, nulls,
+							replaces);
+
+	/* Invalidate the relcache. */
+	if (pubform->puballtables)
+	{
+		CacheInvalidateRelcacheAll();
+	}
+	else
+	{
+		List	   *relids = NIL;
+		List	   *schemarelids = NIL;
+
+		/*
+		 * For partition table, when we insert data, get_rel_sync_entry is called and
+		 * a hash entry is created for the corresponding leaf table. So invalidating
+		 * the leaf nodes would be sufficient here.
+		 */
+		relids = GetPublicationRelations(pubform->oid,
+										 PUBLICATION_PART_LEAF);
+		schemarelids = GetAllSchemaPublicationRelations(pubform->oid,
+														PUBLICATION_PART_LEAF);
+
+		relids = list_concat_unique_oid(relids, schemarelids);
+
+		InvalidatePublicationRels(relids);
+	}
+
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	ObjectAddressSet(address, PublicationRelationId, pubform->oid);
+
+	heap_freetuple(tup);
+
+	table_close(rel, RowExclusiveLock);
+
+	return address;
+}
+
 /* check_functions_in_node callback */
 static bool
 contain_mutable_or_user_functions_checker(Oid func_id, void *context)
@@ -1920,6 +2001,32 @@ AlterPublicationOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
 	}
 
 	form->pubowner = newOwnerId;
+
+	/* Invalidate the relcache. */
+	if (form->puballtables)
+	{
+		CacheInvalidateRelcacheAll();
+	}
+	else
+	{
+		List	   *relids = NIL;
+		List	   *schemarelids = NIL;
+
+		/*
+		 * For partition table, when we insert data, get_rel_sync_entry is called and
+		 * a hash entry is created for the corresponding leaf table. So invalidating
+		 * the leaf nodes would be sufficient here.
+		 */
+		relids = GetPublicationRelations(form->oid,
+										 PUBLICATION_PART_LEAF);
+		schemarelids = GetAllSchemaPublicationRelations(form->oid,
+														PUBLICATION_PART_LEAF);
+
+		relids = list_concat_unique_oid(relids, schemarelids);
+
+		InvalidatePublicationRels(relids);
+	}
+
 	CatalogTupleUpdate(rel, &tup->t_self, tup);
 
 	/* Update owner dependency reference */
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 4aa8646af7..ec10bfdd8c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -9466,7 +9466,7 @@ RenameStmt: ALTER AGGREGATE aggregate_with_argtypes RENAME TO name
 					RenameStmt *n = makeNode(RenameStmt);
 
 					n->renameType = OBJECT_PUBLICATION;
-					n->object = (Node *) makeString($3);
+					n->subname = $3;
 					n->newname = $6;
 					n->missing_ok = false;
 					$$ = (Node *) n;
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 00e7024563..b8429be8cf 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -1739,12 +1739,6 @@ static void
 publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 {
 	publications_valid = false;
-
-	/*
-	 * Also invalidate per-relation cache so that next time the filtering info
-	 * is checked it will be updated with the new publication settings.
-	 */
-	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
 /*
@@ -1920,18 +1914,6 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  rel_sync_cache_publication_cb,
 								  (Datum) 0);
 
-	/*
-	 * Flush all cache entries after any publication changes.  (We need no
-	 * callback entry for pg_publication, because publication_invalidation_cb
-	 * will take care of it.)
-	 */
-	CacheRegisterSyscacheCallback(PUBLICATIONRELMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
-	CacheRegisterSyscacheCallback(PUBLICATIONNAMESPACEMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
-
 	relation_callbacks_registered = true;
 }
 
diff --git a/src/include/commands/publicationcmds.h b/src/include/commands/publicationcmds.h
index 5487c571f6..b953193812 100644
--- a/src/include/commands/publicationcmds.h
+++ b/src/include/commands/publicationcmds.h
@@ -35,5 +35,6 @@ extern bool pub_rf_contains_invalid_column(Oid pubid, Relation relation,
 										   List *ancestors, bool pubviaroot);
 extern bool pub_collist_contains_invalid_column(Oid pubid, Relation relation,
 												List *ancestors, bool pubviaroot);
+extern ObjectAddress RenamePublication(const char *oldname, const char *newname);
 
 #endif							/* PUBLICATIONCMDS_H */
-- 
2.34.1

v13-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchapplication/octet-stream; name=v13-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchDownload

From 6c3c7500cf457bc71e66261c3569b150486d626f Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v13 1/2] Distribute invalidatons if change in catalog tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  34 ++-
 src/include/replication/reorderbuffer.h       |   4 +
 src/test/subscription/t/100_bugs.pl           | 267 ++++++++++++++++++
 4 files changed, 296 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 22bcf171ff..c5dfc1ab06 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -221,9 +221,6 @@ int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED
  */
 static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
 static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -622,7 +619,7 @@ ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0450f94ba8..1f7c24cad0 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -300,7 +300,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -859,18 +859,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -913,6 +916,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1184,8 +1195,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e332635f70..093d21213a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -743,6 +743,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
+											   TransactionId xid, bool create, bool *is_new,
+											   XLogRecPtr lsn, bool create_as_top);
+
 extern void StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..c581e28261 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,273 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', "DROP  PUBLICATION pub1");
+$node_subscriber->safe_psql('postgres', "DROP  SUBSCRIPTION sub1");
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+));
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+));
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION");
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even when
+# tables are dropped from the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Maintain an active transaction with the table that will be dropped from the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+));
+
+# Maintain an active transaction with a schema table that will be dropped from the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (4);
+));
+
+# Drop the table from the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+	INSERT INTO sch3.tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is not replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that data for table tab_conc are not replicated to subscriber');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table sch3.tab_conc are not replicated to subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is dropped in a concurrent active transaction.
+
+# Add tables to the publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (6);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (6);
+));
+
+# Drop publication.
+$background_psql3->query_safe("DROP PUBLICATION regress_pub1");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (7)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (7)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+my $offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is renamed in a concurrent active transaction.
+
+# Create publication.
+$background_psql3->query_safe(
+	"CREATE PUBLICATION regress_pub1 FOR TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Create subscription.
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (8);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (8);
+));
+
+# Rename publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 RENAME TO regress_pub1_rename");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (9)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (9)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+$offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.34.1

#65

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 1 year ago

In reply to: Shlok Kyal (#64)

1 attachment(s)

RE: long-standing data loss bug in initial sync of logical replication

Dear Shlok,

I have tested this part. I observed that ,whenever we insert data in a
partition table, the function 'get_rel_sync_entry' is called and a
hash entry is created for the corresponding leaf node relid. So I feel
while invalidating here we can specify 'PUBLICATION_PART_LEAF' . I
have made the corresponding changes 0002 patch.

I also verified and it seems true. The root table is a virtual table and actual
changes are recorded in leaf ones. It is same for WAL layer. Logical decoding
obtains info from WAL records so leaf tables are passed to pgoutput layer as
"relation". I.e., I think it is enough to invalidate relcache of leaf.

I have also modified the tests in 0001 patch. These changes are only
related to syntax of writing tests.

LGTM. I found small improvements, please find the attached.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

minor_fix.diffsapplication/octet-stream; name=minor_fix.diffsDownload

diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index b70091a3c6..ab380c60be 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -489,9 +489,9 @@ RenamePublication(const char *oldname, const char *newname)
 		List	   *schemarelids = NIL;
 
 		/*
-		 * For partition table, when we insert data, get_rel_sync_entry is called and
-		 * a hash entry is created for the corresponding leaf table. So invalidating
-		 * the leaf nodes would be sufficient here.
+		 * For partition table, when we insert data, get_rel_sync_entry is
+		 * called and a hash entry is created for the corresponding leaf table.
+		 * So invalidating the leaf nodes would be sufficient here.
 		 */
 		relids = GetPublicationRelations(pubform->oid,
 										 PUBLICATION_PART_LEAF);
@@ -2013,9 +2013,9 @@ AlterPublicationOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
 		List	   *schemarelids = NIL;
 
 		/*
-		 * For partition table, when we insert data, get_rel_sync_entry is called and
-		 * a hash entry is created for the corresponding leaf table. So invalidating
-		 * the leaf nodes would be sufficient here.
+		 * For partition table, when we insert data, get_rel_sync_entry is
+		 * called and a hash entry is created for the corresponding leaf table.
+		 * So invalidating the leaf nodes would be sufficient here.
 		 */
 		relids = GetPublicationRelations(form->oid,
 										 PUBLICATION_PART_LEAF);
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1f7c24cad0..d0a5e7d026 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -867,13 +867,15 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
  * catalog contents).
  */
 static void
-SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn,
+									TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
 	ReorderBufferTXN *curr_txn;
 
-	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL,
+									 InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -923,7 +925,8 @@ SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, Transact
 		 */
 		if (txn->xid != xid && curr_txn->ninvalidations > 0)
 			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
-										  curr_txn->ninvalidations, curr_txn->invalidations);
+										  curr_txn->ninvalidations,
+										  curr_txn->invalidations);
 	}
 }
 
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index c581e28261..72aaaae272 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -488,8 +488,8 @@ is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
 # Clean up
-$node_publisher->safe_psql('postgres', "DROP  PUBLICATION pub1");
-$node_subscriber->safe_psql('postgres', "DROP  SUBSCRIPTION sub1");
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION pub1");
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION sub1");
 
 # The bug was that the incremental data synchronization was being skipped when
 # a new table is added to the publication in presence of a concurrent active

#66

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Hayato Kuroda (Fujitsu) (#65)

2 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

Hi Kuroda-san,

I have also modified the tests in 0001 patch. These changes are only
related to syntax of writing tests.

LGTM. I found small improvements, please find the attached.

I have applied the changes and updated the patch.

Thanks & Regards,
Shlok Kyal

Attachments:

v14-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchapplication/x-patch; name=v14-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchDownload

From 07f94de76be177d0e39762cb2bd36a4bc04a7993 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v14 1/2] Distribute invalidatons if change in catalog tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  34 ++-
 src/include/replication/reorderbuffer.h       |   4 +
 src/test/subscription/t/100_bugs.pl           | 267 ++++++++++++++++++
 4 files changed, 296 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 22bcf171ff..c5dfc1ab06 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -221,9 +221,6 @@ int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED
  */
 static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
 static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -622,7 +619,7 @@ ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0450f94ba8..1f7c24cad0 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -300,7 +300,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -859,18 +859,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -913,6 +916,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1184,8 +1195,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e332635f70..093d21213a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -743,6 +743,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
+											   TransactionId xid, bool create, bool *is_new,
+											   XLogRecPtr lsn, bool create_as_top);
+
 extern void StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..72aaaae272 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,273 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION pub1");
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION sub1");
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+));
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+));
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION");
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even when
+# tables are dropped from the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Maintain an active transaction with the table that will be dropped from the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+));
+
+# Maintain an active transaction with a schema table that will be dropped from the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (4);
+));
+
+# Drop the table from the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+	INSERT INTO sch3.tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is not replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that data for table tab_conc are not replicated to subscriber');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table sch3.tab_conc are not replicated to subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is dropped in a concurrent active transaction.
+
+# Add tables to the publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (6);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (6);
+));
+
+# Drop publication.
+$background_psql3->query_safe("DROP PUBLICATION regress_pub1");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (7)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (7)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+my $offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is renamed in a concurrent active transaction.
+
+# Create publication.
+$background_psql3->query_safe(
+	"CREATE PUBLICATION regress_pub1 FOR TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Create subscription.
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (8);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (8);
+));
+
+# Rename publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 RENAME TO regress_pub1_rename");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (9)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (9)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+$offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.34.1

v14-0002-Selective-Invalidation-of-Cache.patchapplication/x-patch; name=v14-0002-Selective-Invalidation-of-Cache.patchDownload

From a134f762eec24dbacf1f9b94a8b777cfb58655c7 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 4 Oct 2024 12:25:31 +0530
Subject: [PATCH v14 2/2] Selective Invalidation of Cache

When we alter a publication, add/drop namespace to/from publication
all the cache for all the tables are invalidated.
With this patch for the above operationns we will invalidate the
cache of only the desired tables.
---
 src/backend/commands/alter.c                |   4 +-
 src/backend/commands/publicationcmds.c      | 107 ++++++++++++++++++++
 src/backend/parser/gram.y                   |   2 +-
 src/backend/replication/logical/snapbuild.c |   9 +-
 src/backend/replication/pgoutput/pgoutput.c |  18 ----
 src/include/commands/publicationcmds.h      |   1 +
 6 files changed, 118 insertions(+), 23 deletions(-)

diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 4f99ebb447..395fe530b3 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -399,6 +399,9 @@ ExecRenameStmt(RenameStmt *stmt)
 		case OBJECT_TYPE:
 			return RenameType(stmt);
 
+		case OBJECT_PUBLICATION:
+			return RenamePublication(stmt->subname, stmt->newname);
+
 		case OBJECT_AGGREGATE:
 		case OBJECT_COLLATION:
 		case OBJECT_CONVERSION:
@@ -416,7 +419,6 @@ ExecRenameStmt(RenameStmt *stmt)
 		case OBJECT_TSDICTIONARY:
 		case OBJECT_TSPARSER:
 		case OBJECT_TSTEMPLATE:
-		case OBJECT_PUBLICATION:
 		case OBJECT_SUBSCRIPTION:
 			{
 				ObjectAddress address;
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index d6ffef374e..ab380c60be 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -433,6 +433,87 @@ pub_collist_contains_invalid_column(Oid pubid, Relation relation, List *ancestor
 	return result;
 }
 
+/*
+ * Execute ALTER PUBLICATION RENAME
+ */
+ObjectAddress
+RenamePublication(const char *oldname, const char *newname)
+{
+	Relation			rel;
+	HeapTuple			tup;
+	ObjectAddress		address;
+	Form_pg_publication	pubform;
+	bool				replaces[Natts_pg_publication];
+	bool				nulls[Natts_pg_publication];
+	Datum				values[Natts_pg_publication];
+
+	rel = table_open(PublicationRelationId, RowExclusiveLock);
+
+	tup = SearchSysCacheCopy1(PUBLICATIONNAME,
+							  CStringGetDatum(oldname));
+
+	if (!HeapTupleIsValid(tup))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_OBJECT),
+				 errmsg("publication \"%s\" does not exist",
+						oldname)));
+
+	pubform = (Form_pg_publication) GETSTRUCT(tup);
+
+	/* must be owner */
+	if (!object_ownercheck(PublicationRelationId, pubform->oid, GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER, OBJECT_PUBLICATION,
+					   NameStr(pubform->pubname));
+
+	/* Everything ok, form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* Only update the pubname */
+	values[Anum_pg_publication_pubname - 1] =
+		DirectFunctionCall1(namein, CStringGetDatum(newname));
+	replaces[Anum_pg_publication_pubname - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel), values, nulls,
+							replaces);
+
+	/* Invalidate the relcache. */
+	if (pubform->puballtables)
+	{
+		CacheInvalidateRelcacheAll();
+	}
+	else
+	{
+		List	   *relids = NIL;
+		List	   *schemarelids = NIL;
+
+		/*
+		 * For partition table, when we insert data, get_rel_sync_entry is
+		 * called and a hash entry is created for the corresponding leaf table.
+		 * So invalidating the leaf nodes would be sufficient here.
+		 */
+		relids = GetPublicationRelations(pubform->oid,
+										 PUBLICATION_PART_LEAF);
+		schemarelids = GetAllSchemaPublicationRelations(pubform->oid,
+														PUBLICATION_PART_LEAF);
+
+		relids = list_concat_unique_oid(relids, schemarelids);
+
+		InvalidatePublicationRels(relids);
+	}
+
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	ObjectAddressSet(address, PublicationRelationId, pubform->oid);
+
+	heap_freetuple(tup);
+
+	table_close(rel, RowExclusiveLock);
+
+	return address;
+}
+
 /* check_functions_in_node callback */
 static bool
 contain_mutable_or_user_functions_checker(Oid func_id, void *context)
@@ -1920,6 +2001,32 @@ AlterPublicationOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
 	}
 
 	form->pubowner = newOwnerId;
+
+	/* Invalidate the relcache. */
+	if (form->puballtables)
+	{
+		CacheInvalidateRelcacheAll();
+	}
+	else
+	{
+		List	   *relids = NIL;
+		List	   *schemarelids = NIL;
+
+		/*
+		 * For partition table, when we insert data, get_rel_sync_entry is
+		 * called and a hash entry is created for the corresponding leaf table.
+		 * So invalidating the leaf nodes would be sufficient here.
+		 */
+		relids = GetPublicationRelations(form->oid,
+										 PUBLICATION_PART_LEAF);
+		schemarelids = GetAllSchemaPublicationRelations(form->oid,
+														PUBLICATION_PART_LEAF);
+
+		relids = list_concat_unique_oid(relids, schemarelids);
+
+		InvalidatePublicationRels(relids);
+	}
+
 	CatalogTupleUpdate(rel, &tup->t_self, tup);
 
 	/* Update owner dependency reference */
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 4aa8646af7..ec10bfdd8c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -9466,7 +9466,7 @@ RenameStmt: ALTER AGGREGATE aggregate_with_argtypes RENAME TO name
 					RenameStmt *n = makeNode(RenameStmt);
 
 					n->renameType = OBJECT_PUBLICATION;
-					n->object = (Node *) makeString($3);
+					n->subname = $3;
 					n->newname = $6;
 					n->missing_ok = false;
 					$$ = (Node *) n;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1f7c24cad0..d0a5e7d026 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -867,13 +867,15 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
  * catalog contents).
  */
 static void
-SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn,
+									TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
 	ReorderBufferTXN *curr_txn;
 
-	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL,
+									 InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -923,7 +925,8 @@ SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, Transact
 		 */
 		if (txn->xid != xid && curr_txn->ninvalidations > 0)
 			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
-										  curr_txn->ninvalidations, curr_txn->invalidations);
+										  curr_txn->ninvalidations,
+										  curr_txn->invalidations);
 	}
 }
 
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 00e7024563..b8429be8cf 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -1739,12 +1739,6 @@ static void
 publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 {
 	publications_valid = false;
-
-	/*
-	 * Also invalidate per-relation cache so that next time the filtering info
-	 * is checked it will be updated with the new publication settings.
-	 */
-	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
 /*
@@ -1920,18 +1914,6 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  rel_sync_cache_publication_cb,
 								  (Datum) 0);
 
-	/*
-	 * Flush all cache entries after any publication changes.  (We need no
-	 * callback entry for pg_publication, because publication_invalidation_cb
-	 * will take care of it.)
-	 */
-	CacheRegisterSyscacheCallback(PUBLICATIONRELMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
-	CacheRegisterSyscacheCallback(PUBLICATIONNAMESPACEMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
-
 	relation_callbacks_registered = true;
 }
 
diff --git a/src/include/commands/publicationcmds.h b/src/include/commands/publicationcmds.h
index 5487c571f6..b953193812 100644
--- a/src/include/commands/publicationcmds.h
+++ b/src/include/commands/publicationcmds.h
@@ -35,5 +35,6 @@ extern bool pub_rf_contains_invalid_column(Oid pubid, Relation relation,
 										   List *ancestors, bool pubviaroot);
 extern bool pub_collist_contains_invalid_column(Oid pubid, Relation relation,
 												List *ancestors, bool pubviaroot);
+extern ObjectAddress RenamePublication(const char *oldname, const char *newname);
 
 #endif							/* PUBLICATIONCMDS_H */
-- 
2.34.1

#67

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#37)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, 31 Jul 2024 at 03:27, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

BTW, I noticed that we don't take any table-level locks for Create
Publication .. For ALL TABLES (and Drop Publication). Can that create
a similar problem? I haven't tested so not sure but even if there is a
problem for the Create case, it should lead to some ERROR like missing
publication.

I tested these scenarios, and as you expected, it throws an error for
the create publication case:
2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
data from WAL stream: ERROR: publication "pub1" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change
callback, associated LSN 0/1510CD8
2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
"logical replication apply worker" (PID 481526) exited with exit code
1

The steps for this process are as follows:
1) Create tables in both the publisher and subscriber.
2) On the publisher: Create a replication slot.
3) On the subscriber: Create a subscription using the slot created by
the publisher.
4) On the publisher:
4.a) Session 1: BEGIN; INSERT INTO T1;
4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
4.c) Session 1: COMMIT;

Since we are throwing out a "publication does not exist" error, there
is no inconsistency issue here.

However, an issue persists with DROP ALL TABLES publication, where
data continues to replicate even after the publication is dropped.
This happens because the open transaction consumes the invalidation,
causing the publications to be revalidated using old snapshot. As a
result, both the open transactions and the subsequent transactions are
getting replicated.

We can reproduce this issue by following these steps in a logical
replication setup with an "ALL TABLES" publication:
On the publisher:
Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
In another session on the publisher:
Session 2: DROP PUBLICATION
Back in Session 1 on the publisher:
COMMIT;
Finally, in Session 1 on the publisher:
INSERT INTO T1 VALUES (val2);

Even after dropping the publication, both val1 and val2 are still
being replicated to the subscriber. This means that both the
in-progress concurrent transaction and the subsequent transactions are
being replicated.

I don't think locking all tables is a viable solution in this case, as
it would require asking the user to refrain from performing any
operations on any of the tables in the database while creating a
publication.

Indeed, locking all tables in the database to prevent concurrent DMLs
for this scenario also looks odd to me. The other alternative
previously suggested by Andres is to distribute catalog modifying
transactions to all concurrent in-progress transactions [1] but as
mentioned this could add an overhead. One possibility to reduce
overhead is that we selectively distribute invalidations for
catalogs-related publications but I haven't analyzed the feasibility.

We need more opinions to decide here, so let me summarize the problem
and solutions discussed. As explained with an example in an email [1],
the problem related to logical decoding is that it doesn't process
invalidations corresponding to DDLs for the already in-progress
transactions. We discussed preventing DMLs in the first place when
concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in
progress. The solution discussed was to acquire
ShareUpdateExclusiveLock for all the tables being added via such
commands. Further analysis revealed that the same handling is required
for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all
the tables in the specified schemas. Then DROP PUBLICATION also seems
to have similar symptoms which means in the worst case (where
publication is for ALL TABLES) we have to lock all the tables in the
database. We are not sure if that is good so the other alternative we
can pursue is to distribute invalidations in logical decoding
infrastructure [1] which has its downsides.

Thoughts?

Thank you for summarizing the problem and solutions!

I think it's worth trying the idea of distributing invalidation
messages, and we will see if there could be overheads or any further
obstacles. IIUC this approach would resolve another issue we discussed
before too[1].

Regards,

[1] /messages/by-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com

Hi Sawada-san,

I have tested the scenario shared by you on the thread [1]/messages/by-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com. And I
confirm that the latest patch [2]/messages/by-id/CANhcyEWfqdUvn2d2KOdvkhebBi5VO6O8J+C6+OwsPNwCTM=akQ@mail.gmail.com fixes this issue.

[1]: /messages/by-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com
[2]: /messages/by-id/CANhcyEWfqdUvn2d2KOdvkhebBi5VO6O8J+C6+OwsPNwCTM=akQ@mail.gmail.com

Thanks and Regards,
Shlok Kyal

#68

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Shlok Kyal (#67)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, Oct 08, 2024 at 03:21:38PM +0530, Shlok Kyal wrote:

I have tested the scenario shared by you on the thread [1]. And I
confirm that the latest patch [2] fixes this issue.

[1] /messages/by-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com
[2] /messages/by-id/CANhcyEWfqdUvn2d2KOdvkhebBi5VO6O8J+C6+OwsPNwCTM=akQ@mail.gmail.com

Sawada-san, are you planning to look at that? It looks like this
thread is waiting for your input.
--
Michael

#69

Masahiko Sawada

sawada.mshk@gmail.com

about 1 year ago

In reply to: Michael Paquier (#68)

Re: long-standing data loss bug in initial sync of logical replication

On Mon, Dec 9, 2024 at 10:27 PM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Oct 08, 2024 at 03:21:38PM +0530, Shlok Kyal wrote:

I have tested the scenario shared by you on the thread [1]. And I
confirm that the latest patch [2] fixes this issue.

[1] /messages/by-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com
[2] /messages/by-id/CANhcyEWfqdUvn2d2KOdvkhebBi5VO6O8J+C6+OwsPNwCTM=akQ@mail.gmail.com

Sawada-san, are you planning to look at that? It looks like this
thread is waiting for your input.

Sorry I lost track of this thread. I'll check the test results and patch soon.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#70

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Masahiko Sawada (#69)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, Dec 10, 2024 at 01:50:16PM -0800, Masahiko Sawada wrote:

Sorry I lost track of this thread. I'll check the test results and
patch soon.

Thanks.
--
Michael

#71

Shlok Kyal

shlok.kyal.oss@gmail.com

about 1 year ago

In reply to: Shlok Kyal (#66)

2 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, 8 Oct 2024 at 11:11, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

Hi Kuroda-san,

I have also modified the tests in 0001 patch. These changes are only
related to syntax of writing tests.

LGTM. I found small improvements, please find the attached.

I have applied the changes and updated the patch.

Patches needed a rebase. Attached the rebased patch.

Thanks and Regards,
Shlok Kyal

Attachments:

v15-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchapplication/octet-stream; name=v15-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchDownload

From 8fa004c5e2e5ec985314870e0844daef50883f04 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v15 1/2] Distribute invalidatons if change in catalog tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  34 ++-
 src/include/replication/reorderbuffer.h       |   4 +
 src/test/subscription/t/100_bugs.pl           | 267 ++++++++++++++++++
 4 files changed, 296 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e3a5c7b660..c0ef3d429b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -221,9 +221,6 @@ int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED
  */
 static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
 static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -627,7 +624,7 @@ ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a6a4da3266..28a685dba0 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -161,7 +161,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -720,18 +720,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -774,6 +777,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1045,8 +1056,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3bc365a7b0..12668849b7 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -733,6 +733,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
+											   TransactionId xid, bool create, bool *is_new,
+											   XLogRecPtr lsn, bool create_as_top);
+
 extern void StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 794b928f50..24817128ea 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -477,6 +477,273 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION pub1");
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION sub1");
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+));
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+));
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION");
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even when
+# tables are dropped from the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Maintain an active transaction with the table that will be dropped from the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+));
+
+# Maintain an active transaction with a schema table that will be dropped from the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (4);
+));
+
+# Drop the table from the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+	INSERT INTO sch3.tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is not replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that data for table tab_conc are not replicated to subscriber');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table sch3.tab_conc are not replicated to subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is dropped in a concurrent active transaction.
+
+# Add tables to the publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (6);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (6);
+));
+
+# Drop publication.
+$background_psql3->query_safe("DROP PUBLICATION regress_pub1");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (7)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (7)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+my $offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is renamed in a concurrent active transaction.
+
+# Create publication.
+$background_psql3->query_safe(
+	"CREATE PUBLICATION regress_pub1 FOR TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Create subscription.
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (8);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (8);
+));
+
+# Rename publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 RENAME TO regress_pub1_rename");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (9)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (9)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+$offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.34.1

v15-0002-Selective-Invalidation-of-Cache.patchapplication/octet-stream; name=v15-0002-Selective-Invalidation-of-Cache.patchDownload

From 46e5e3b85a528009a7235f2bcec670cd4fbfc47c Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Wed, 11 Dec 2024 08:48:55 +0530
Subject: [PATCH v15 2/2] Selective Invalidation of Cache

When we alter a publication, add/drop namespace to/from publication
all the cache for all the tables are invalidated.
With this patch for the above operationns we will invalidate the
cache of only the desired tables.
---
 src/backend/commands/alter.c                |   4 +-
 src/backend/commands/publicationcmds.c      | 107 ++++++++++++++++++++
 src/backend/parser/gram.y                   |   2 +-
 src/backend/replication/logical/snapbuild.c |   9 +-
 src/backend/replication/pgoutput/pgoutput.c |  18 ----
 src/include/commands/publicationcmds.h      |   1 +
 6 files changed, 118 insertions(+), 23 deletions(-)

diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index a45f3bb6b8..79bd6b7cef 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -399,6 +399,9 @@ ExecRenameStmt(RenameStmt *stmt)
 		case OBJECT_TYPE:
 			return RenameType(stmt);
 
+		case OBJECT_PUBLICATION:
+			return RenamePublication(stmt->subname, stmt->newname);
+
 		case OBJECT_AGGREGATE:
 		case OBJECT_COLLATION:
 		case OBJECT_CONVERSION:
@@ -416,7 +419,6 @@ ExecRenameStmt(RenameStmt *stmt)
 		case OBJECT_TSDICTIONARY:
 		case OBJECT_TSPARSER:
 		case OBJECT_TSTEMPLATE:
-		case OBJECT_PUBLICATION:
 		case OBJECT_SUBSCRIPTION:
 			{
 				ObjectAddress address;
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 5050057a7e..402cd44c5f 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -470,6 +470,87 @@ pub_contains_invalid_column(Oid pubid, Relation relation, List *ancestors,
 	return *invalid_column_list || *invalid_gen_col;
 }
 
+/*
+ * Execute ALTER PUBLICATION RENAME
+ */
+ObjectAddress
+RenamePublication(const char *oldname, const char *newname)
+{
+	Relation			rel;
+	HeapTuple			tup;
+	ObjectAddress		address;
+	Form_pg_publication	pubform;
+	bool				replaces[Natts_pg_publication];
+	bool				nulls[Natts_pg_publication];
+	Datum				values[Natts_pg_publication];
+
+	rel = table_open(PublicationRelationId, RowExclusiveLock);
+
+	tup = SearchSysCacheCopy1(PUBLICATIONNAME,
+							  CStringGetDatum(oldname));
+
+	if (!HeapTupleIsValid(tup))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_OBJECT),
+				 errmsg("publication \"%s\" does not exist",
+						oldname)));
+
+	pubform = (Form_pg_publication) GETSTRUCT(tup);
+
+	/* must be owner */
+	if (!object_ownercheck(PublicationRelationId, pubform->oid, GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER, OBJECT_PUBLICATION,
+					   NameStr(pubform->pubname));
+
+	/* Everything ok, form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* Only update the pubname */
+	values[Anum_pg_publication_pubname - 1] =
+		DirectFunctionCall1(namein, CStringGetDatum(newname));
+	replaces[Anum_pg_publication_pubname - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel), values, nulls,
+							replaces);
+
+	/* Invalidate the relcache. */
+	if (pubform->puballtables)
+	{
+		CacheInvalidateRelcacheAll();
+	}
+	else
+	{
+		List	   *relids = NIL;
+		List	   *schemarelids = NIL;
+
+		/*
+		 * For partition table, when we insert data, get_rel_sync_entry is
+		 * called and a hash entry is created for the corresponding leaf table.
+		 * So invalidating the leaf nodes would be sufficient here.
+		 */
+		relids = GetPublicationRelations(pubform->oid,
+										 PUBLICATION_PART_LEAF);
+		schemarelids = GetAllSchemaPublicationRelations(pubform->oid,
+														PUBLICATION_PART_LEAF);
+
+		relids = list_concat_unique_oid(relids, schemarelids);
+
+		InvalidatePublicationRels(relids);
+	}
+
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	ObjectAddressSet(address, PublicationRelationId, pubform->oid);
+
+	heap_freetuple(tup);
+
+	table_close(rel, RowExclusiveLock);
+
+	return address;
+}
+
 /* check_functions_in_node callback */
 static bool
 contain_mutable_or_user_functions_checker(Oid func_id, void *context)
@@ -1973,6 +2054,32 @@ AlterPublicationOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
 	}
 
 	form->pubowner = newOwnerId;
+
+	/* Invalidate the relcache. */
+	if (form->puballtables)
+	{
+		CacheInvalidateRelcacheAll();
+	}
+	else
+	{
+		List	   *relids = NIL;
+		List	   *schemarelids = NIL;
+
+		/*
+		 * For partition table, when we insert data, get_rel_sync_entry is
+		 * called and a hash entry is created for the corresponding leaf table.
+		 * So invalidating the leaf nodes would be sufficient here.
+		 */
+		relids = GetPublicationRelations(form->oid,
+										 PUBLICATION_PART_LEAF);
+		schemarelids = GetAllSchemaPublicationRelations(form->oid,
+														PUBLICATION_PART_LEAF);
+
+		relids = list_concat_unique_oid(relids, schemarelids);
+
+		InvalidatePublicationRels(relids);
+	}
+
 	CatalogTupleUpdate(rel, &tup->t_self, tup);
 
 	/* Update owner dependency reference */
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 67eb96396a..bc2da291ca 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -9475,7 +9475,7 @@ RenameStmt: ALTER AGGREGATE aggregate_with_argtypes RENAME TO name
 					RenameStmt *n = makeNode(RenameStmt);
 
 					n->renameType = OBJECT_PUBLICATION;
-					n->object = (Node *) makeString($3);
+					n->subname = $3;
 					n->newname = $6;
 					n->missing_ok = false;
 					$$ = (Node *) n;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 28a685dba0..f00465b737 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -728,13 +728,15 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
  * catalog contents).
  */
 static void
-SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn,
+									TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
 	ReorderBufferTXN *curr_txn;
 
-	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL,
+									 InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -784,7 +786,8 @@ SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, Transact
 		 */
 		if (txn->xid != xid && curr_txn->ninvalidations > 0)
 			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
-										  curr_txn->ninvalidations, curr_txn->invalidations);
+										  curr_txn->ninvalidations,
+										  curr_txn->invalidations);
 	}
 }
 
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index b50b3d62e3..c68955fa85 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -1781,12 +1781,6 @@ static void
 publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 {
 	publications_valid = false;
-
-	/*
-	 * Also invalidate per-relation cache so that next time the filtering info
-	 * is checked it will be updated with the new publication settings.
-	 */
-	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
 /*
@@ -1962,18 +1956,6 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  rel_sync_cache_publication_cb,
 								  (Datum) 0);
 
-	/*
-	 * Flush all cache entries after any publication changes.  (We need no
-	 * callback entry for pg_publication, because publication_invalidation_cb
-	 * will take care of it.)
-	 */
-	CacheRegisterSyscacheCallback(PUBLICATIONRELMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
-	CacheRegisterSyscacheCallback(PUBLICATIONNAMESPACEMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
-
 	relation_callbacks_registered = true;
 }
 
diff --git a/src/include/commands/publicationcmds.h b/src/include/commands/publicationcmds.h
index 19037518e8..b812d6d19f 100644
--- a/src/include/commands/publicationcmds.h
+++ b/src/include/commands/publicationcmds.h
@@ -38,5 +38,6 @@ extern bool pub_contains_invalid_column(Oid pubid, Relation relation,
 										bool pubgencols,
 										bool *invalid_column_list,
 										bool *invalid_gen_col);
+extern ObjectAddress RenamePublication(const char *oldname, const char *newname);
 
 #endif							/* PUBLICATIONCMDS_H */
-- 
2.34.1

#72

Masahiko Sawada

sawada.mshk@gmail.com

about 1 year ago

In reply to: Shlok Kyal (#67)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, Oct 8, 2024 at 2:51 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

On Wed, 31 Jul 2024 at 03:27, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

BTW, I noticed that we don't take any table-level locks for Create
Publication .. For ALL TABLES (and Drop Publication). Can that create
a similar problem? I haven't tested so not sure but even if there is a
problem for the Create case, it should lead to some ERROR like missing
publication.

I tested these scenarios, and as you expected, it throws an error for
the create publication case:
2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
data from WAL stream: ERROR: publication "pub1" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change
callback, associated LSN 0/1510CD8
2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
"logical replication apply worker" (PID 481526) exited with exit code
1

The steps for this process are as follows:
1) Create tables in both the publisher and subscriber.
2) On the publisher: Create a replication slot.
3) On the subscriber: Create a subscription using the slot created by
the publisher.
4) On the publisher:
4.a) Session 1: BEGIN; INSERT INTO T1;
4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
4.c) Session 1: COMMIT;

Since we are throwing out a "publication does not exist" error, there
is no inconsistency issue here.

However, an issue persists with DROP ALL TABLES publication, where
data continues to replicate even after the publication is dropped.
This happens because the open transaction consumes the invalidation,
causing the publications to be revalidated using old snapshot. As a
result, both the open transactions and the subsequent transactions are
getting replicated.

We can reproduce this issue by following these steps in a logical
replication setup with an "ALL TABLES" publication:
On the publisher:
Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
In another session on the publisher:
Session 2: DROP PUBLICATION
Back in Session 1 on the publisher:
COMMIT;
Finally, in Session 1 on the publisher:
INSERT INTO T1 VALUES (val2);

Even after dropping the publication, both val1 and val2 are still
being replicated to the subscriber. This means that both the
in-progress concurrent transaction and the subsequent transactions are
being replicated.

I don't think locking all tables is a viable solution in this case, as
it would require asking the user to refrain from performing any
operations on any of the tables in the database while creating a
publication.

Indeed, locking all tables in the database to prevent concurrent DMLs
for this scenario also looks odd to me. The other alternative
previously suggested by Andres is to distribute catalog modifying
transactions to all concurrent in-progress transactions [1] but as
mentioned this could add an overhead. One possibility to reduce
overhead is that we selectively distribute invalidations for
catalogs-related publications but I haven't analyzed the feasibility.

We need more opinions to decide here, so let me summarize the problem
and solutions discussed. As explained with an example in an email [1],
the problem related to logical decoding is that it doesn't process
invalidations corresponding to DDLs for the already in-progress
transactions. We discussed preventing DMLs in the first place when
concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in
progress. The solution discussed was to acquire
ShareUpdateExclusiveLock for all the tables being added via such
commands. Further analysis revealed that the same handling is required
for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all
the tables in the specified schemas. Then DROP PUBLICATION also seems
to have similar symptoms which means in the worst case (where
publication is for ALL TABLES) we have to lock all the tables in the
database. We are not sure if that is good so the other alternative we
can pursue is to distribute invalidations in logical decoding
infrastructure [1] which has its downsides.

Thoughts?

Thank you for summarizing the problem and solutions!

I think it's worth trying the idea of distributing invalidation
messages, and we will see if there could be overheads or any further
obstacles. IIUC this approach would resolve another issue we discussed
before too[1].

Regards,

[1] /messages/by-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com

Hi Sawada-san,

I have tested the scenario shared by you on the thread [1]. And I
confirm that the latest patch [2] fixes this issue.

I confirmed that the proposed patch fixes these issues. I have one
question about the patch:

In the main loop in SnapBuildDistributeSnapshotAndInval(), we have the
following code:

/*
* If we don't have a base snapshot yet, there are no changes in this
* transaction which in turn implies we don't yet need a snapshot at
* all. We'll add a snapshot when the first change gets queued.
*
* NB: This works correctly even for subtransactions because
* ReorderBufferAssignChild() takes care to transfer the base snapshot
* to the top-level transaction, and while iterating the changequeue
* we'll get the change from the subtxn.
*/
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid))
continue;

Is there any case where we need to distribute inval messages to
transactions that don't have the base snapshot yet but eventually need
the inval messages?

Overall, with this idea, we distribute invalidation messages to all
concurrent decoded transactions. It could introduce performance
regressions by several causes. For example, we could end up
invalidating RelationSyncCache entries in more cases. While this is
addressed by your selectively cache invalidation patch, there is still
5% regression. We might need to accept a certain amount of regressions
for making it correct but it would be better to figure out where these
regressions come from. Other than that, I think the performance
regression could happen due to the costs of distributing invalidation
messages. You've already observed there is 1~3% performance regression
in cases where we distribute a large amount of invalidation messages
to one concurrently decoded transaction[1]/messages/by-id/CANhcyEX+C3G68W51myHWfbpAdmSXDwHdMsWUa+zHBF_QKKvZMw@mail.gmail.com. I guess that the
selectively cache invalidation idea would not help this case. Also, I
think we might want to test other cases like where we distribute a
small amount of invalidation messages to many concurrently decoded
transactions.

Regards,

[1]: /messages/by-id/CANhcyEX+C3G68W51myHWfbpAdmSXDwHdMsWUa+zHBF_QKKvZMw@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#73

Amit Kapila

amit.kapila16@gmail.com

11 months ago

In reply to: Masahiko Sawada (#72)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, Dec 11, 2024 at 12:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I confirmed that the proposed patch fixes these issues. I have one
question about the patch:

In the main loop in SnapBuildDistributeSnapshotAndInval(), we have the
following code:

/*
* If we don't have a base snapshot yet, there are no changes in this
* transaction which in turn implies we don't yet need a snapshot at
* all. We'll add a snapshot when the first change gets queued.
*
* NB: This works correctly even for subtransactions because
* ReorderBufferAssignChild() takes care to transfer the base snapshot
* to the top-level transaction, and while iterating the changequeue
* we'll get the change from the subtxn.
*/
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid))
continue;

Is there any case where we need to distribute inval messages to
transactions that don't have the base snapshot yet but eventually need
the inval messages?

Good point. It is mentioned that for snapshots: "We'll add a snapshot
when the first change gets queued.". I think we achieve this via
builder->committed.xip array such that when we set a base snapshot for
a transaction, we use that array to form a snapshot. However, I don't
see any such consideration for invalidations. Now, we could either
always add invalidations to xacts that don't have base_snapshot yet or
have a mechanism similar committed.xid array. But it is better to
first reproduce the problem.

Overall, with this idea, we distribute invalidation messages to all
concurrent decoded transactions. It could introduce performance
regressions by several causes. For example, we could end up
invalidating RelationSyncCache entries in more cases. While this is
addressed by your selectively cache invalidation patch, there is still
5% regression. We might need to accept a certain amount of regressions
for making it correct but it would be better to figure out where these
regressions come from. Other than that, I think the performance
regression could happen due to the costs of distributing invalidation
messages. You've already observed there is 1~3% performance regression
in cases where we distribute a large amount of invalidation messages
to one concurrently decoded transaction[1]. I guess that the
selectively cache invalidation idea would not help this case. Also, I
think we might want to test other cases like where we distribute a
small amount of invalidation messages to many concurrently decoded
transactions.

+1.

--
With Regards,
Amit Kapila.

#74

Shlok Kyal

shlok.kyal.oss@gmail.com

11 months ago

In reply to: Shlok Kyal (#71)

2 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, 11 Dec 2024 at 09:13, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

On Tue, 8 Oct 2024 at 11:11, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

Hi Kuroda-san,

I have also modified the tests in 0001 patch. These changes are only
related to syntax of writing tests.

LGTM. I found small improvements, please find the attached.

I have applied the changes and updated the patch.

Patches needed a rebase. Attached the rebased patch.

Patches need a rebase. Attached the rebased patch.

Thanks and regards,
Shlok Kyal

Attachments:

v16-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchapplication/octet-stream; name=v16-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchDownload

From c6f01f4b335b724024292e6f69e7498d84055429 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v16 1/2] Distribute invalidatons if change in catalog tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  34 ++-
 src/include/replication/reorderbuffer.h       |   4 +
 src/test/subscription/t/100_bugs.pl           | 267 ++++++++++++++++++
 4 files changed, 296 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b42f4002ba8..340fdd50f8a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -222,9 +222,6 @@ int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED
  */
 static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
 static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -630,7 +627,7 @@ ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index bd0680dcbe5..1716340799b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -161,7 +161,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -720,18 +720,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -774,6 +777,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1045,8 +1056,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 517a8e3634f..4f3ceef0092 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -759,6 +759,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
+											   TransactionId xid, bool create, bool *is_new,
+											   XLogRecPtr lsn, bool create_as_top);
+
 extern void StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 83120f1cb6f..d6dbeebed54 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -477,6 +477,273 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION pub1");
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION sub1");
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+));
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+));
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION");
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even when
+# tables are dropped from the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Maintain an active transaction with the table that will be dropped from the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+));
+
+# Maintain an active transaction with a schema table that will be dropped from the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (4);
+));
+
+# Drop the table from the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+	INSERT INTO sch3.tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is not replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that data for table tab_conc are not replicated to subscriber');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table sch3.tab_conc are not replicated to subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is dropped in a concurrent active transaction.
+
+# Add tables to the publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (6);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (6);
+));
+
+# Drop publication.
+$background_psql3->query_safe("DROP PUBLICATION regress_pub1");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (7)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (7)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+my $offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is renamed in a concurrent active transaction.
+
+# Create publication.
+$background_psql3->query_safe(
+	"CREATE PUBLICATION regress_pub1 FOR TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Create subscription.
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (8);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (8);
+));
+
+# Rename publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 RENAME TO regress_pub1_rename");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (9)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (9)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+$offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.41.0.windows.3

v16-0002-Selective-Invalidation-of-Cache.patchapplication/octet-stream; name=v16-0002-Selective-Invalidation-of-Cache.patchDownload

From 65c97c89c984475c0e28a26796c983ec9acc5e46 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Mon, 24 Feb 2025 15:34:31 +0530
Subject: [PATCH v16 2/2] Selective Invalidation of Cache

When we alter a publication, add/drop namespace to/from publication
all the cache for all the tables are invalidated.
With this patch for the above operationns we will invalidate the
cache of only the desired tables.
---
 src/backend/commands/alter.c                |   4 +-
 src/backend/commands/publicationcmds.c      | 107 ++++++++++++++++++++
 src/backend/parser/gram.y                   |   2 +-
 src/backend/replication/pgoutput/pgoutput.c |  18 ----
 src/include/commands/publicationcmds.h      |   1 +
 5 files changed, 112 insertions(+), 20 deletions(-)

diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 78c1d4e1b84..a79329acc1f 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -400,6 +400,9 @@ ExecRenameStmt(RenameStmt *stmt)
 		case OBJECT_TYPE:
 			return RenameType(stmt);
 
+		case OBJECT_PUBLICATION:
+			return RenamePublication(stmt->subname, stmt->newname);
+
 		case OBJECT_AGGREGATE:
 		case OBJECT_COLLATION:
 		case OBJECT_CONVERSION:
@@ -417,7 +420,6 @@ ExecRenameStmt(RenameStmt *stmt)
 		case OBJECT_TSDICTIONARY:
 		case OBJECT_TSPARSER:
 		case OBJECT_TSTEMPLATE:
-		case OBJECT_PUBLICATION:
 		case OBJECT_SUBSCRIPTION:
 			{
 				ObjectAddress address;
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 150a768d16f..182d2187f1c 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -491,6 +491,87 @@ pub_contains_invalid_column(Oid pubid, Relation relation, List *ancestors,
 	return *invalid_column_list || *invalid_gen_col;
 }
 
+/*
+ * Execute ALTER PUBLICATION RENAME
+ */
+ObjectAddress
+RenamePublication(const char *oldname, const char *newname)
+{
+	Relation			rel;
+	HeapTuple			tup;
+	ObjectAddress		address;
+	Form_pg_publication	pubform;
+	bool				replaces[Natts_pg_publication];
+	bool				nulls[Natts_pg_publication];
+	Datum				values[Natts_pg_publication];
+
+	rel = table_open(PublicationRelationId, RowExclusiveLock);
+
+	tup = SearchSysCacheCopy1(PUBLICATIONNAME,
+							  CStringGetDatum(oldname));
+
+	if (!HeapTupleIsValid(tup))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_OBJECT),
+				 errmsg("publication \"%s\" does not exist",
+						oldname)));
+
+	pubform = (Form_pg_publication) GETSTRUCT(tup);
+
+	/* must be owner */
+	if (!object_ownercheck(PublicationRelationId, pubform->oid, GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER, OBJECT_PUBLICATION,
+					   NameStr(pubform->pubname));
+
+	/* Everything ok, form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* Only update the pubname */
+	values[Anum_pg_publication_pubname - 1] =
+		DirectFunctionCall1(namein, CStringGetDatum(newname));
+	replaces[Anum_pg_publication_pubname - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel), values, nulls,
+							replaces);
+
+	/* Invalidate the relcache. */
+	if (pubform->puballtables)
+	{
+		CacheInvalidateRelcacheAll();
+	}
+	else
+	{
+		List	   *relids = NIL;
+		List	   *schemarelids = NIL;
+
+		/*
+		 * For partition table, when we insert data, get_rel_sync_entry is
+		 * called and a hash entry is created for the corresponding leaf table.
+		 * So invalidating the leaf nodes would be sufficient here.
+		 */
+		relids = GetPublicationRelations(pubform->oid,
+										 PUBLICATION_PART_LEAF);
+		schemarelids = GetAllSchemaPublicationRelations(pubform->oid,
+														PUBLICATION_PART_LEAF);
+
+		relids = list_concat_unique_oid(relids, schemarelids);
+
+		InvalidatePublicationRels(relids);
+	}
+
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	ObjectAddressSet(address, PublicationRelationId, pubform->oid);
+
+	heap_freetuple(tup);
+
+	table_close(rel, RowExclusiveLock);
+
+	return address;
+}
+
 /* check_functions_in_node callback */
 static bool
 contain_mutable_or_user_functions_checker(Oid func_id, void *context)
@@ -1996,6 +2077,32 @@ AlterPublicationOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
 	}
 
 	form->pubowner = newOwnerId;
+
+	/* Invalidate the relcache. */
+	if (form->puballtables)
+	{
+		CacheInvalidateRelcacheAll();
+	}
+	else
+	{
+		List	   *relids = NIL;
+		List	   *schemarelids = NIL;
+
+		/*
+		 * For partition table, when we insert data, get_rel_sync_entry is
+		 * called and a hash entry is created for the corresponding leaf table.
+		 * So invalidating the leaf nodes would be sufficient here.
+		 */
+		relids = GetPublicationRelations(form->oid,
+										 PUBLICATION_PART_LEAF);
+		schemarelids = GetAllSchemaPublicationRelations(form->oid,
+														PUBLICATION_PART_LEAF);
+
+		relids = list_concat_unique_oid(relids, schemarelids);
+
+		InvalidatePublicationRels(relids);
+	}
+
 	CatalogTupleUpdate(rel, &tup->t_self, tup);
 
 	/* Update owner dependency reference */
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 7d99c9355c6..49fe0567c57 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -9513,7 +9513,7 @@ RenameStmt: ALTER AGGREGATE aggregate_with_argtypes RENAME TO name
 					RenameStmt *n = makeNode(RenameStmt);
 
 					n->renameType = OBJECT_PUBLICATION;
-					n->object = (Node *) makeString($3);
+					n->subname = $3;
 					n->newname = $6;
 					n->missing_ok = false;
 					$$ = (Node *) n;
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7d464f656aa..b28ce636d50 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -1789,12 +1789,6 @@ static void
 publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 {
 	publications_valid = false;
-
-	/*
-	 * Also invalidate per-relation cache so that next time the filtering info
-	 * is checked it will be updated with the new publication settings.
-	 */
-	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
 /*
@@ -1970,18 +1964,6 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  rel_sync_cache_publication_cb,
 								  (Datum) 0);
 
-	/*
-	 * Flush all cache entries after any publication changes.  (We need no
-	 * callback entry for pg_publication, because publication_invalidation_cb
-	 * will take care of it.)
-	 */
-	CacheRegisterSyscacheCallback(PUBLICATIONRELMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
-	CacheRegisterSyscacheCallback(PUBLICATIONNAMESPACEMAP,
-								  rel_sync_cache_publication_cb,
-								  (Datum) 0);
-
 	relation_callbacks_registered = true;
 }
 
diff --git a/src/include/commands/publicationcmds.h b/src/include/commands/publicationcmds.h
index e11a942ea0f..3dfceef70f9 100644
--- a/src/include/commands/publicationcmds.h
+++ b/src/include/commands/publicationcmds.h
@@ -38,5 +38,6 @@ extern bool pub_contains_invalid_column(Oid pubid, Relation relation,
 										char pubgencols_type,
 										bool *invalid_column_list,
 										bool *invalid_gen_col);
+extern ObjectAddress RenamePublication(const char *oldname, const char *newname);
 
 #endif							/* PUBLICATIONCMDS_H */
-- 
2.41.0.windows.3

#75

Benoit Lobréau

benoit.lobreau@dalibo.com

11 months ago

In reply to: Masahiko Sawada (#72)

Re: long-standing data loss bug in initial sync of logical replication

Hi,

After reading the thread and doing a bit of testing, the problem seems
significant and is still present. The fact that it's probably not well
known makes it more concerning, in my opinion. I was wondering what
could be done to help move this topic forward (given my limited abilities)?

--
Benoit Lobréau
Consultant
http://dalibo.com

#76

Amit Kapila

amit.kapila16@gmail.com

11 months ago

In reply to: Benoit Lobréau (#75)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, Feb 25, 2025 at 3:56 PM Benoit Lobréau
<benoit.lobreau@dalibo.com> wrote:

After reading the thread and doing a bit of testing, the problem seems
significant and is still present. The fact that it's probably not well
known makes it more concerning, in my opinion. I was wondering what
could be done to help move this topic forward (given my limited abilities)?

You can help with the review/test of the proposed patch. Also, help
with the performance impact of the patch, if possible. Shlok has done
some performance testing of the patch which you can perform
independently and then Sawada-San has asked for more performance tests
in his last email (1) which you can also help with.

(1) - /messages/by-id/CAD21AoDoWc8MWTyKtmNF_606bcW6J0gV==r=VmPXKUN-e3o9ew@mail.gmail.com

--
With Regards,
Amit Kapila.

#77

Shlok Kyal

shlok.kyal.oss@gmail.com

11 months ago

In reply to: Masahiko Sawada (#72)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, 11 Dec 2024 at 12:37, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Oct 8, 2024 at 2:51 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

On Wed, 31 Jul 2024 at 03:27, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 17, 2024 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 17 Jul 2024 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 16, 2024 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

BTW, I noticed that we don't take any table-level locks for Create
Publication .. For ALL TABLES (and Drop Publication). Can that create
a similar problem? I haven't tested so not sure but even if there is a
problem for the Create case, it should lead to some ERROR like missing
publication.

I tested these scenarios, and as you expected, it throws an error for
the create publication case:
2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive
data from WAL stream: ERROR: publication "pub1" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change
callback, associated LSN 0/1510CD8
2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker
"logical replication apply worker" (PID 481526) exited with exit code
1

The steps for this process are as follows:
1) Create tables in both the publisher and subscriber.
2) On the publisher: Create a replication slot.
3) On the subscriber: Create a subscription using the slot created by
the publisher.
4) On the publisher:
4.a) Session 1: BEGIN; INSERT INTO T1;
4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
4.c) Session 1: COMMIT;

Since we are throwing out a "publication does not exist" error, there
is no inconsistency issue here.

However, an issue persists with DROP ALL TABLES publication, where
data continues to replicate even after the publication is dropped.
This happens because the open transaction consumes the invalidation,
causing the publications to be revalidated using old snapshot. As a
result, both the open transactions and the subsequent transactions are
getting replicated.

We can reproduce this issue by following these steps in a logical
replication setup with an "ALL TABLES" publication:
On the publisher:
Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
In another session on the publisher:
Session 2: DROP PUBLICATION
Back in Session 1 on the publisher:
COMMIT;
Finally, in Session 1 on the publisher:
INSERT INTO T1 VALUES (val2);

Even after dropping the publication, both val1 and val2 are still
being replicated to the subscriber. This means that both the
in-progress concurrent transaction and the subsequent transactions are
being replicated.

I don't think locking all tables is a viable solution in this case, as
it would require asking the user to refrain from performing any
operations on any of the tables in the database while creating a
publication.

Indeed, locking all tables in the database to prevent concurrent DMLs
for this scenario also looks odd to me. The other alternative
previously suggested by Andres is to distribute catalog modifying
transactions to all concurrent in-progress transactions [1] but as
mentioned this could add an overhead. One possibility to reduce
overhead is that we selectively distribute invalidations for
catalogs-related publications but I haven't analyzed the feasibility.

We need more opinions to decide here, so let me summarize the problem
and solutions discussed. As explained with an example in an email [1],
the problem related to logical decoding is that it doesn't process
invalidations corresponding to DDLs for the already in-progress
transactions. We discussed preventing DMLs in the first place when
concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in
progress. The solution discussed was to acquire
ShareUpdateExclusiveLock for all the tables being added via such
commands. Further analysis revealed that the same handling is required
for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all
the tables in the specified schemas. Then DROP PUBLICATION also seems
to have similar symptoms which means in the worst case (where
publication is for ALL TABLES) we have to lock all the tables in the
database. We are not sure if that is good so the other alternative we
can pursue is to distribute invalidations in logical decoding
infrastructure [1] which has its downsides.

Thoughts?

Thank you for summarizing the problem and solutions!

I think it's worth trying the idea of distributing invalidation
messages, and we will see if there could be overheads or any further
obstacles. IIUC this approach would resolve another issue we discussed
before too[1].

Regards,

[1] /messages/by-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com

Hi Sawada-san,

I have tested the scenario shared by you on the thread [1]. And I
confirm that the latest patch [2] fixes this issue.

I confirmed that the proposed patch fixes these issues. I have one
question about the patch:

In the main loop in SnapBuildDistributeSnapshotAndInval(), we have the
following code:

/*
* If we don't have a base snapshot yet, there are no changes in this
* transaction which in turn implies we don't yet need a snapshot at
* all. We'll add a snapshot when the first change gets queued.
*
* NB: This works correctly even for subtransactions because
* ReorderBufferAssignChild() takes care to transfer the base snapshot
* to the top-level transaction, and while iterating the changequeue
* we'll get the change from the subtxn.
*/
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid))
continue;

Is there any case where we need to distribute inval messages to
transactions that don't have the base snapshot yet but eventually need
the inval messages?

Overall, with this idea, we distribute invalidation messages to all
concurrent decoded transactions. It could introduce performance
regressions by several causes. For example, we could end up
invalidating RelationSyncCache entries in more cases. While this is
addressed by your selectively cache invalidation patch, there is still
5% regression. We might need to accept a certain amount of regressions
for making it correct but it would be better to figure out where these
regressions come from. Other than that, I think the performance
regression could happen due to the costs of distributing invalidation
messages. You've already observed there is 1~3% performance regression
in cases where we distribute a large amount of invalidation messages
to one concurrently decoded transaction[1]. I guess that the
selectively cache invalidation idea would not help this case. Also, I
think we might want to test other cases like where we distribute a
small amount of invalidation messages to many concurrently decoded
transactions.

Hi Sawada-san,

I have done the performance testing for cases where we distribute a
small amount of invalidation messages to many concurrently decoded
transactions.
Here are results:

Concurrent Txn | Head (sec) | Patch (sec) | Degradation in %
---------------------------------------------------------------------------------------------
50 | 0.2627734 | 0.2654608 | 1.022706256
100 | 0.4801048 | 0.4869254 | 1.420648158
500 | 2.2170336 | 2.2438656 | 1.210265825
1000 | 4.4957402 | 4.5282574 | 0.723289126
2000 | 9.2013082 | 9.21164 | 0.112286207

The steps I followed is:
1. Initially logical replication is setup.
2. Then we start 'n' number of concurrent transactions.
Each txn look like:
BEGIN;
Insert into t1 values(11);
3. Now we add two invalidation which will be distributed each
transaction by running command:
ALTER PUBLICATION regress_pub1 DROP TABLE t1
ALTER PUBLICATION regress_pub1 ADD TABLE t1
4. Then run an insert for each txn. It will build cache for relation
in each txn.
5. Commit Each transaction.

I have also attached the script.

Thanks and Regards,
Shlok Kyal

#78

Amit Kapila

amit.kapila16@gmail.com

11 months ago

In reply to: Shlok Kyal (#77)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, Feb 26, 2025 at 9:21 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

I have done the performance testing for cases where we distribute a
small amount of invalidation messages to many concurrently decoded
transactions.
Here are results:

Concurrent Txn | Head (sec) | Patch (sec) | Degradation in %
---------------------------------------------------------------------------------------------
50 | 0.2627734 | 0.2654608 | 1.022706256
100 | 0.4801048 | 0.4869254 | 1.420648158
500 | 2.2170336 | 2.2438656 | 1.210265825
1000 | 4.4957402 | 4.5282574 | 0.723289126
2000 | 9.2013082 | 9.21164 | 0.112286207

The steps I followed is:
1. Initially logical replication is setup.
2. Then we start 'n' number of concurrent transactions.
Each txn look like:
BEGIN;
Insert into t1 values(11);
3. Now we add two invalidation which will be distributed each
transaction by running command:
ALTER PUBLICATION regress_pub1 DROP TABLE t1
ALTER PUBLICATION regress_pub1 ADD TABLE t1
4. Then run an insert for each txn. It will build cache for relation
in each txn.
5. Commit Each transaction.

I have also attached the script.

The tests are done using pub-sub setup which has some overhead of
logical replication as well. Can we try this test by fetching changes
via SQL API using pgoutput as plugin to see the impact?

--
With Regards,
Amit Kapila.

#79

Zhijie Hou (Fujitsu)

houzj.fnst@fujitsu.com

11 months ago

In reply to: Amit Kapila (#73)

1 attachment(s)

RE: long-standing data loss bug in initial sync of logical replication

On Monday, February 24, 2025 5:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 11, 2024 at 12:37 PM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

I confirmed that the proposed patch fixes these issues. I have one
question about the patch:

In the main loop in SnapBuildDistributeSnapshotAndInval(), we have the
following code:

/*
* If we don't have a base snapshot yet, there are no changes in this
* transaction which in turn implies we don't yet need a snapshot at
* all. We'll add a snapshot when the first change gets queued.
*
* NB: This works correctly even for subtransactions because
* ReorderBufferAssignChild() takes care to transfer the base

snapshot

* to the top-level transaction, and while iterating the changequeue
* we'll get the change from the subtxn.
*/
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid))
continue;

Is there any case where we need to distribute inval messages to
transactions that don't have the base snapshot yet but eventually need
the inval messages?

Good point. It is mentioned that for snapshots: "We'll add a snapshot
when the first change gets queued.". I think we achieve this via
builder->committed.xip array such that when we set a base snapshot for
a transaction, we use that array to form a snapshot. However, I don't
see any such consideration for invalidations. Now, we could either
always add invalidations to xacts that don't have base_snapshot yet or
have a mechanism similar committed.xid array. But it is better to
first reproduce the problem.

I think distributing invalidations to a transaction that has not yet built a
base snapshot is un-necessary. This is because, during the process of building
its base snapshot, such a transaction will have already recorded the XID of the
transaction that altered the publication information into its array of
committed XIDs. Consequently, it will reflect the latest changes in the catalog
from the beginning. In the context of logical decoding, this scenario is
analogous to decoding a new transaction initiated after the catalog-change
transaction has been committed.

The original issue arises because the catalog cache was constructed using an
outdated snapshot that could not reflect the latest catalog changes. However,
this is not a problem in cases without a base snapshot. Since the existing
catalog cache should have been invalidated upon decoding the committed
catalog-change transaction, the subsequent transactions will construct a new
cache with the latest snapshot.

I also considered the scenario where only a sub-transaction has a base snapshot
that has not yet been transferred to its top-level transaction. However, I
think this is not problematic because a sub-transaction transfers its snapshot
immediately upon building it (see ReorderBufferSetBaseSnapshot). The only
exception is if the sub-transaction is independent (i.e., not yet associated
with its top-level transaction). In such a case, the sub-transaction is treated
as a top-level transaction, and invalidations will be distributed to this
sub-transaction after applying the patch which is sufficient to resolve the
issue.

Considering the complexity of this topic, I think it would be better to add some
comments like the attachment

Best Regards,
Hou zj

Attachments:

0001-add-comments-for-txns-without-base-snapshot.patchapplication/octet-stream; name=0001-add-comments-for-txns-without-base-snapshot.patchDownload

From 8e26f5f06124aba8f12d89e549e47ada7aa6329e Mon Sep 17 00:00:00 2001
From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
Date: Thu, 27 Feb 2025 16:06:11 +0800
Subject: [PATCH] add comments for txns without base snapshot

---
 src/backend/replication/logical/snapbuild.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1716340799b..6917ca27181 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -752,6 +752,15 @@ SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, Transact
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Moreover, distributing invalidations to this transaction at this
+		 * stage is unnecessary. Once a base snapshot is built, it will
+		 * naturally include the xids of committed transactions that have
+		 * modified the catalog, thus reflecting the new catalog contents. The
+		 * existing catalog cache will have already been invalidated after
+		 * processing the invalidations in the transaction that modified
+		 * catalogs, ensuring that a fresh cache is constructed during
+		 * decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
-- 
2.30.0.windows.2

#80

Masahiko Sawada

sawada.mshk@gmail.com

11 months ago

In reply to: Zhijie Hou (Fujitsu) (#79)

Re: long-standing data loss bug in initial sync of logical replication

On Thu, Feb 27, 2025 at 12:14 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Monday, February 24, 2025 5:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 11, 2024 at 12:37 PM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

I confirmed that the proposed patch fixes these issues. I have one
question about the patch:

In the main loop in SnapBuildDistributeSnapshotAndInval(), we have the
following code:

/*
* If we don't have a base snapshot yet, there are no changes in this
* transaction which in turn implies we don't yet need a snapshot at
* all. We'll add a snapshot when the first change gets queued.
*
* NB: This works correctly even for subtransactions because
* ReorderBufferAssignChild() takes care to transfer the base

snapshot

* to the top-level transaction, and while iterating the changequeue
* we'll get the change from the subtxn.
*/
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid))
continue;

Is there any case where we need to distribute inval messages to
transactions that don't have the base snapshot yet but eventually need
the inval messages?

Good point. It is mentioned that for snapshots: "We'll add a snapshot
when the first change gets queued.". I think we achieve this via
builder->committed.xip array such that when we set a base snapshot for
a transaction, we use that array to form a snapshot. However, I don't
see any such consideration for invalidations. Now, we could either
always add invalidations to xacts that don't have base_snapshot yet or
have a mechanism similar committed.xid array. But it is better to
first reproduce the problem.

I think distributing invalidations to a transaction that has not yet built a
base snapshot is un-necessary. This is because, during the process of building
its base snapshot, such a transaction will have already recorded the XID of the
transaction that altered the publication information into its array of
committed XIDs. Consequently, it will reflect the latest changes in the catalog
from the beginning. In the context of logical decoding, this scenario is
analogous to decoding a new transaction initiated after the catalog-change
transaction has been committed.

The original issue arises because the catalog cache was constructed using an
outdated snapshot that could not reflect the latest catalog changes. However,
this is not a problem in cases without a base snapshot. Since the existing
catalog cache should have been invalidated upon decoding the committed
catalog-change transaction, the subsequent transactions will construct a new
cache with the latest snapshot.

I've also concluded it's not necessary but the reason and analysis
might be somewhat different. IIUC in the original issue (looking at
Andres's reproducer[1]/messages/by-id/20231118025445.crhaeeuvoe2g5dv6@awork3.anarazel.de), the fact that when replaying a
non-catalog-change transaction, the walsender constructed the snapshot
that doesn't reflect the catalog change is fine because the first
change of that transaction was made before the catalog change. The
problem is that the walsender process absorbed the invalidation
message when replaying the change that happened before the catalog
change, and ended up keeping replaying the subsequent changes with
that snapshot. That is why we concluded that we need to distribute the
invalidation messages to concurrently decoded transactions so that we
can invalidate the cache again at that point. As the comment
mentioned, the base snapshot is set before queuing any changes, so if
the transaction doesn't have the base snapshot yet, there must be no
queued change that happened before the catalog change. The
transactions that initiated after the catalog change don't have this
issue.

Regards,

[1]: /messages/by-id/20231118025445.crhaeeuvoe2g5dv6@awork3.anarazel.de

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#81

Amit Kapila

amit.kapila16@gmail.com

11 months ago

In reply to: Masahiko Sawada (#80)

Re: long-standing data loss bug in initial sync of logical replication

On Fri, Feb 28, 2025 at 6:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Feb 27, 2025 at 12:14 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Monday, February 24, 2025 5:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 11, 2024 at 12:37 PM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

I confirmed that the proposed patch fixes these issues. I have one
question about the patch:

In the main loop in SnapBuildDistributeSnapshotAndInval(), we have the
following code:

/*
* If we don't have a base snapshot yet, there are no changes in this
* transaction which in turn implies we don't yet need a snapshot at
* all. We'll add a snapshot when the first change gets queued.
*
* NB: This works correctly even for subtransactions because
* ReorderBufferAssignChild() takes care to transfer the base

snapshot

* to the top-level transaction, and while iterating the changequeue
* we'll get the change from the subtxn.
*/
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid))
continue;

Is there any case where we need to distribute inval messages to
transactions that don't have the base snapshot yet but eventually need
the inval messages?

Good point. It is mentioned that for snapshots: "We'll add a snapshot
when the first change gets queued.". I think we achieve this via
builder->committed.xip array such that when we set a base snapshot for
a transaction, we use that array to form a snapshot. However, I don't
see any such consideration for invalidations. Now, we could either
always add invalidations to xacts that don't have base_snapshot yet or
have a mechanism similar committed.xid array. But it is better to
first reproduce the problem.

I think distributing invalidations to a transaction that has not yet built a
base snapshot is un-necessary. This is because, during the process of building
its base snapshot, such a transaction will have already recorded the XID of the
transaction that altered the publication information into its array of
committed XIDs. Consequently, it will reflect the latest changes in the catalog
from the beginning. In the context of logical decoding, this scenario is
analogous to decoding a new transaction initiated after the catalog-change
transaction has been committed.

The original issue arises because the catalog cache was constructed using an
outdated snapshot that could not reflect the latest catalog changes. However,
this is not a problem in cases without a base snapshot. Since the existing
catalog cache should have been invalidated upon decoding the committed
catalog-change transaction, the subsequent transactions will construct a new
cache with the latest snapshot.

I've also concluded it's not necessary but the reason and analysis
might be somewhat different. IIUC in the original issue (looking at
Andres's reproducer[1]), the fact that when replaying a
non-catalog-change transaction, the walsender constructed the snapshot
that doesn't reflect the catalog change is fine because the first
change of that transaction was made before the catalog change. The
problem is that the walsender process absorbed the invalidation
message when replaying the change that happened before the catalog
change, and ended up keeping replaying the subsequent changes with
that snapshot. That is why we concluded that we need to distribute the
invalidation messages to concurrently decoded transactions so that we
can invalidate the cache again at that point. As the comment
mentioned, the base snapshot is set before queuing any changes, so if
the transaction doesn't have the base snapshot yet, there must be no
queued change that happened before the catalog change. The
transactions that initiated after the catalog change don't have this
issue.

I think both of you are saying the same thing with slightly different
words. Hou-San's explanation goes into more detail at the code level,
and you have said the same thing with a slightly higher-level view.
Additionally, for streaming transactions where we would have already
sent one or more streams, we don't need anything special since they
behave similarly to a transaction having a base snapshot because we
save the snapshot after sending each stream.

--
With Regards,
Amit Kapila.

#82

Amit Kapila

amit.kapila16@gmail.com

11 months ago

In reply to: Amit Kapila (#81)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Fri, Feb 28, 2025 at 9:45 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 28, 2025 at 6:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Feb 27, 2025 at 12:14 AM Zhijie Hou (Fujitsu)

I think distributing invalidations to a transaction that has not yet built a
base snapshot is un-necessary. This is because, during the process of building
its base snapshot, such a transaction will have already recorded the XID of the
transaction that altered the publication information into its array of
committed XIDs. Consequently, it will reflect the latest changes in the catalog
from the beginning. In the context of logical decoding, this scenario is
analogous to decoding a new transaction initiated after the catalog-change
transaction has been committed.

The original issue arises because the catalog cache was constructed using an
outdated snapshot that could not reflect the latest catalog changes. However,
this is not a problem in cases without a base snapshot. Since the existing
catalog cache should have been invalidated upon decoding the committed
catalog-change transaction, the subsequent transactions will construct a new
cache with the latest snapshot.

I've also concluded it's not necessary but the reason and analysis
might be somewhat different.

Based on the discussion on this point and Hou-San's proposed comment,
I have tried to add/edit a few comments in 0001 patch. See, if those
make sense to you, it is important to capture the reason and theory we
discussed here in the form of comments so that it will be easy to
remember the reason in the future.

--
With Regards,
Amit Kapila.

Attachments:

v17-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchapplication/octet-stream; name=v17-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchDownload

From 80be7b27ca4f4b0435c3b0b0de41536baabbc82b Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v17] Distribute invalidatons if change in catalog tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  54 +++-
 src/include/replication/reorderbuffer.h       |   4 +
 src/test/subscription/t/100_bugs.pl           | 267 ++++++++++++++++++
 4 files changed, 313 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5186ad2a397..72fc74544c7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -222,9 +222,6 @@ int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED
  */
 static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
 static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -630,7 +627,7 @@ ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index bd0680dcbe5..6f6039f4ef2 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -161,7 +161,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -720,18 +720,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -749,6 +752,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -758,13 +769,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_is_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -774,6 +785,20 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1045,8 +1070,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 517a8e3634f..4f3ceef0092 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -759,6 +759,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
+											   TransactionId xid, bool create, bool *is_new,
+											   XLogRecPtr lsn, bool create_as_top);
+
 extern void StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 83120f1cb6f..d6dbeebed54 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -477,6 +477,273 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION pub1");
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION sub1");
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+));
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+));
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION");
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even when
+# tables are dropped from the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Maintain an active transaction with the table that will be dropped from the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+));
+
+# Maintain an active transaction with a schema table that will be dropped from the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (4);
+));
+
+# Drop the table from the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+	INSERT INTO sch3.tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is not replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that data for table tab_conc are not replicated to subscriber');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table sch3.tab_conc are not replicated to subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is dropped in a concurrent active transaction.
+
+# Add tables to the publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (6);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (6);
+));
+
+# Drop publication.
+$background_psql3->query_safe("DROP PUBLICATION regress_pub1");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (7)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (7)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+my $offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is renamed in a concurrent active transaction.
+
+# Create publication.
+$background_psql3->query_safe(
+	"CREATE PUBLICATION regress_pub1 FOR TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Create subscription.
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (8);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (8);
+));
+
+# Rename publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 RENAME TO regress_pub1_rename");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (9)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (9)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+$offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.28.0.windows.1

#83

Benoit Lobréau

benoit.lobreau@dalibo.com

11 months ago

In reply to: Amit Kapila (#82)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

Hi,

It took me a while but I ran the test on my laptop with 20 runs per
test. I asked for a dedicated server and will re-run the tests if/when I
have it.

count of partitions | Head (sec) | Fix (sec) | Degradation (%)
----------------------------------------------------------------------
1000 | 0,0265 | 0,028 | 5,66037735849054
5000 | 0,091 | 0,0945 | 3,84615384615385
10000 | 0,1795 | 0,1815 | 1,11420612813371

Concurrent Txn | Head (sec) | Patch (sec) | Degradation in %
---------------------------------------------------------------------
50 | 0,1797647 | 0,1920949 | 6,85907744957
100 | 0,3693029 | 0,3823425 | 3,53086856344
500 | 1,62265755 | 1,91427485 | 17,97158617972
1000 | 3,01388635 | 3,57678295 | 18,67676928162
2000 | 7,0171877 | 6,4713304 | 8,43500897435

I'll try to run test2.pl later (right now it fails).

hope this helps.

--
Benoit Lobréau
Consultant
http://dalibo.com

Attachments:

cache_inval_bug_v1.odsapplication/vnd.oasis.opendocument.spreadsheet; name=cache_inval_bug_v1.odsDownload

PK�B\Z�l9�..mimetypeapplication/vnd.oasis.opendocument.spreadsheetPK�B\ZConfigurations2/accelerator/PK�B\ZConfigurations2/images/Bitmaps/PK�B\ZConfigurations2/toolpanel/PK�B\ZConfigurations2/floater/PK�B\ZConfigurations2/statusbar/PK�B\ZConfigurations2/toolbar/PK�B\ZConfigurations2/progressbar/PK�B\ZConfigurations2/popupmenu/PK�B\ZConfigurations2/menubar/PK�B\Z
styles.xml�[�r�����`�����`x�dkf=�Z��\��e;��[� 	�9��y�<U�$��79�X��Dvy��n��hz���g��4_���w�G4�y�r���
�r_���&	��2��.C�\���cFr�,�(P�9v,_R�1_�0C|)�%-PnE�]��R��#��b�n�(��w8f���������4 �Qx�;eqL��C��{�C���+q���W"X,��Z��fS�1#�Q.������j����P�M���u����MW�;(���wo-���Tm&�5}��R���;�~�,����	H(�hV@�7���0����@�4���p�%�4��Q���<5$�(�����b-v�(�8�A�p���Lg)��6�XZ�����X��?t,���[6f��M-��y)�����sRX�4�dz�-�YX����sl�)�z����8O�@c6������,+��K���2Qc�N�>X	���AQ���J�M)l����S�%o]��3��T���)�]�Alr����O0"6�e�z�P
2p.���&��3���t������\�F�("|�\�(��,���7�w4�������0�i��H"HQ��VnFc��K�E�]�	>��������T@��0�=(���?����(0~�����C���>�=T)s^�����W����S��p���A��aS���������������&F	�s��3����Z�2�2XlA�h����)7�Z�ssy�f�e�U���DWPH��j2���|~!��y��)Z�	S����E��mO �0�M��Q*��Z�&ZM��yk)F�%�z�U,k8A���h�/�{��-nC(I~�2�'�S����2����}����k��P�r��/�^�����2����g�"�����nr
���[��I�������Q�K�)��<��<������e:��2
�NP^@��8FT�BRl��������lTxW.���edp���!x�r�`8#����y��%�)5������J������Y5��v����r�v�u��J�z!���j�
�{�p2�Z1�����,V���bu$b��\yN���cV��������cY���k��5C�$��jb���\�����18��r�A�J�a��J�g��T�z����k���-oX�=hH�)WV�v4���C]�TA\�����Av��B��b�������-���N���e�k�#$��DT����Q�$I"�Pd�)��E8�
�dZD�hI����|#-$�@��
o��eI�BP"����n�*�r��c�oCC[��1ws����n����CK_������
]���@�E����]W9iYU�L��
��CAO����c�2��^2E�8����@)����#�1����OU\��[��@�e�����E8�F�������v�[j����vd�T.��)-T�]�n�
%v��9������������I�=�����$�<b���~�H�o������2X�'�<�����d�����T��*��il�&���H���P�gG�
BP�S#��O�P+���!i�p~
	v+���{Ixa��zL���)� �����$��z�:Q�i�v�]�j��.+���%}x��Ei��Q5�/s��� �48�������j����o���j����&�2we���2se/��Mv�'t�R_��L���SM���>	|�k���q������)�!����ClJ��;�:�b�^��g�pTO�<�����s�|E�~���
�"�_�����vZw��, ���f2H/W���^�n
��w�o���<���+�]+[�u���{YL�.������.�(���o�X����,���s�U�ru�����>&l��v�z 7��}_nX,n�����3A:�>]E����@���^��(��1�~�+����D�Lh�Ntd_�P��7��o?t�W��H���l=M$�IU��T=�Q��q�y��8�v�}d���:j���������f���_�=�����n������)���xCv������n_Ak�=�w>����O[_��:�O����K*?:��������}��O@ ���Y"��!�e�\���M_in0h���[�a����\1d���y���������UecL]x����iuPU���
���i�.���� 8o��M�x��B�|����`�XD�N��O��a�1�m�:�H4*�PB��`p���^m�R�&xs��\��ku�P�
K8�q4|����U��"@�
���wE�=L�+��^����Y��*�XB�����'�
�z�?{vi�Nd��8�7*���:Es���ao`z����Gi��� 6�x�L�]�?����`vY������)�WV��5����95��u��c��������9�
�RC� /�(�|�A9����*�)p����-����o���/^<���f�h9�x�f��9/[�a�O�j7�@�f�i�����B��u�������p���:�9��w�F�v���J�e���!���{�Rg��?t|���R���"����~)����]�d�b��
���!����{����q��������o	�/9�����������z�������^����x�_PK���x>
�<PK�B\Zmanifest.rdf���n�@��y
�9�ziP �"�U�t�&�������W!iE���I9��|3���6�+xv��ZE�

�������j�/6����(��m�S_�LD�T�a��."����z��D'I�k��R�!�R�� �=
`�]'�08��3��)�������g�L�*L�7�Z���*vR8�#k����(�-H�������t
(=c�+��������P�c��<r�]��e�TW�.����
���|'�[������d�#bA�o�S��kW���
PK��h��PK�B\Zcontent.xml�]�n��������.��l��DIi3w���4������E�m�V_W�����_����>�}�I;q;9�M���@g����9�<<<"���,������,N�H!Q�e�����?���G~�O���Q������(d/.)
�fyV4�U-QH.�C�uq\�&m����X��e%�E���m��{���i�<9�HY����K�/���������b&��N�l������K��H���-f��lA�(����&]�h2���>���^Z4���i�<��<*��Y�w������'���'.D!��M\���v�A/w�z5���.������q88�r��8�a�7����`\z_Y���S
b!o`~�p}���z���y�e:�������^���uWgi������^wt�/d����,Q?HI�����7�jZg���	�x3 }r�^���-Q=��F��H����Y%�T���nv|�	�rk�U�k
^n-o�y�����rb���^��3�������w)lZ��	Mj����d�m]������Z.k�l��k��>-K5�����R�e�F���1�o/�kT+����|k����~�����Xa��f��e.��',?���CQ������R�%�>.{ys��Uk��y�%w��X���4�������E/q����v}������7�geQ��b���ivu������X�N����LD]��T��''G�t&������!�Z����f�������\|���2m�Ut]�:�S�#��X�R������r-mw���)��P�J\���y�(�o����a7������L��~��x���).������Y���4/�-�/����D-S��Qy<���7��'G���'���i�\$���a�k���y�6�+�<�w=���z�����euy�L���.ND:���#��|7���x��^Y�4�Yo�����
�k��$�]F�L��gM�%_-����7R����E��x%F|��;\-qd�a�6U�����?M��i1�����(�{r�b��uJ%�lq����.��u�������HD�����]�����h5d>��;�".��>9������S��r�]�2Kn�K�F�7)/��VNO&fs@����"�ntwf(�D��a)�
5��GK�31�<�*��l����oU/��a�,�5���������L��
;+m�i���;����?��d�+����N��HI2<K�E�)���\����n-*��S������x��5�&w8�!����\%�F�q��qZ,�2��n����F��Q�%��_��R�|�G�D��{�������a��]�=��gKa��6��q�~X���
�w�+����I��2N���>����BP�EX�S;���������pEoz|��)����T��c�g��m�?�PcW���,:�5|�����Y���l�f$�%k�@���/!�-E�+k4���S�w����j���a�\��V��I3B���T����i�K�z���`eQ�r��QZ$��E�\[�Q�9�O3^��L}��JW�.�,�y�47�F���s\m���E_-��,>U�s���nkT���Kt�������1���B�9���1)��#�0~����}g]^�xa}��uy��6zW"�V<�Z/#^�5�C�R)��iQ������JU�:-�GhQ����ou�z��`r.E#�~�};�_�M�z�vE���f�ziA�]�X��l�>}��z�>��M{����cf�;'�`��GY���;'G>^�|�>���}���nY ����N���6x`���;;�\(�����#�x����x��l6'����.�q�A �<S�X���9�8-�\����>��(� C����GQ�P�;��#^�@

�c
)�f@

�S
�>�	U���s�c
	�I�)4�N)�}�pH`�L��Bx�t�:�)�A��#��@�wJ!�G��)�B�pWfq�o3#�m�F�DAD�S�wN"�H�/�5�u��Fp�$R��(��0�nI$�0�"6�c�]��En@aM��D��!va�u4�[�>�]@=	��D��c���1��I�c`\��#;���HB�4�/@b�o5��6���v{���3�p�v0�h��
������!�wN�O�5�����y@)R���&1�	�P[m�]�F	���f�wn	�`
���#;�q��ht�tB4��9�}���A
�����-����:<����	a�dq��x�~%���62�����2:-�;�n?�p@���qG�-�B����Awm	]��|hn����voh*Np�����!���U�;74Q����c�����%�hl�f�y����v��U�bX'���%�����S���9���v}���u�[�>�����b�p�V��n�:�9�GV���l���2�5�!�Q
tXU|���H�0�'
�)f`���1��zH\��rj�c�vH!� 
�x%��!���6m{��� F����R�>�~�t't�)��"L�c�9�G�e��Mp�a`]
l��{!t"6`k�b
�g4��Mp)�6��mM��0t�Jcm�B�(�2h���P��m����%0������C�y`���m�BC���2�#�`m�s�G�Q�.��m��6qB�tT��������{P`k6�-B
�5�!l���m��3�'0KI�5�$�c�:l�.����!%�c����8l�2`BGTe]�X��nY{i���Z��)����D�hA����(d`R���r����v��h��XM��5#���k�5G��)�;���"����b��,�D��=p��e�	&!�2�Y�r��V}h�uF��e��i��=6`k�!�"Ji���!"��Rl�4`F�_���-1`�
&q�^^�P�'3
k�0��h�`��f��E�����������V�x��%�G���'��A����o��h���!�<x����q1pi;lk
�����Z[a ,`P�`��V\M�Q���Y(�c��X�1`[��PJ��@�1`[�xQ�B�*���a���@7C�h������{Y�o�5�)a�y��Y����B��n�������`
�VQa���z�[�sC����1`k��Q��b�[[�\t8�c��d�M�$�$'�4�^^���}���|��Tgp4����.�j+eq�7��k�5B����h����8f0`[�!����3��?�3DA�C�S1`[�(�����>ki�+���^��R`��
_�),t�y��X{k>���j�=7�%���+�������(�1`a�������
Y���f
l�����f�[#�)����>�kN��#\�b�?�������o+=�w�
�y}���][6�lm.1&�O�4�)����[�E��Q�O(���y�[���e�u4��\L�KQ�QP;d����>��wx�L��O�C�A�X�	��R�H���*���\���
x��TmB
�`�;lk�!�A�6j�{f-oC�>�<����2	!����1`k��QH�[���[��>���9l��g�9tm����0X\�%0!�RH
l�4x���-�#Q���
k�0`��;l�0x!4Vp�+P
�
%��Zs�=l+5��G�c`qu�Z�j�"��������jsx����:�h����lq���z���J�m0<
�j�W�D�L��[p�<`]�8+�5��!0�"��+b���*R`��X��+f��2U��(�J1�������G�
�O���'\G�8K�-(��o�(p!q�k�=7]��v�5�)��j�
����z�����
����lm7���Cc�f��5�u�^�*v#x���#�#@�J���{d\{�<�|�P�GJ��{6��egC��1`{�������C}���"�
)�s��5��S�|L�e����mU
��W���2����h��j�m��4�n����.��d����h~yT��4�'G�����_^�������?����n���+�]��rq���h������/[�Dz��zV����*:��j��:-�W����yy�_���E�cR��m��7�q�M;^�v�n����;�j�}n�F���m�a�V��>����]t������t������}t���Lh2�up��|��o�M��$ !���������!:�h��{�\%1GI����y��F��qWv�_�/����_��n�+�#zXU�S
�>���9|+�6�����U{�)����y��������M[���0����Z
�
�������gn��ag�
�?��o����~�F�?X��x��l[2�<+�[������y��5yJ���J0�9�Eg�~�G8</����c�9�[m�5�p^'�W7Rx�N
��S���������{���j7���F
o�I��( ~�>	�I�	��F�!�w72x�Ra�,`���0O�GB�V�����7R��N
a��>�ax�w=������%�����[��?���x-S��Es�����3e6��U��d��D��4����mb���y�4�zmz���"99b�7��Wz�W�����
,���J�H���M�+����"���� :	�����,�i]�B���b��x��P2�w����@�1,�q���M%��W���.��3e;MZ�����3Zh�������O�?qO6���	y'x�t�����/nM������"��|���
��_s&����t�]��K�mOf��2��\[HB����N�F;W�m�ln�Oj���>����Ln�4���&����i�U�H�R��������������d��,� ���k�����ey�]����E=(���I���v��?��X�����oA�#���=>v�B�vF��Ia�T������U�RV��.�U]��`;u�rq�a<?��lm����
sP��+��{�Ac�,5f�Kj;�S���_r�;��������P|���<G���E�s��C
��=df*���$^��4���2��E=�����v�~��B~�����tp���to_��s��Pf��g������~)�-mhA|W�g���V�.�!��ks�����r������V�5�^��-�ce���������[9@A��\_w���NP���n��G|���{>����::�5��m���|ujB[6"�F�u��k_���
����ml��o|uRBK&H��UMi[���16������l���"[`�|2������-��z��o���[����-��I�n��0��p�^9�j/����m,�
n�n���~J�V�<y��=w��������O]v��x�����a����
�}�v���E���^����#>��	�s��n�Rxs#����@��x.f�
	�t�q�v+�R�1����+��
7���z����-�d������e_��v����o�)���������Q-���\o��+>9b���1�;�����h�B��g��R��N����eq��zV3��M����t4�<���+�<�w{"�3��cDkS�\�RT~���n�!��`��0X� ��������(vG~�.yu_���"�*y(�<B��J����{w�oKo��uyy;ex_Z;���]OR)b9���.�S�,d���s�6���O?����e"�j�^���_����A�S�	g�%I-�5���Y��:=�Fh��U��&�y� /4W.�ZN�����WR�����W0����T�L��E������^����!I���g�w����,-2��,m�
��&h�����(��F�_�l�������x_H�/
tV&"C�
kH3�������_}>��A�`���U��C����~���#�����.`�Ft���$�7I��3V����I-x�f�J�(.W9D�����V�9;M�&VM{���>i&V�H:|�>�����`Cu�#���s���<K�1:{���F����b	�|��Wd^�U����mv���x���b���g���F{<?����%@������WDF�t��)��J��d���Q��c��Y�x����e]�j��p����T�(Cq6���%�A<�R��34�55j��wQ���D2G�P��Jit����J����l�
e9��������^K��xr^�z�����AC��ZT���LPQV���*�r|�
���t����	����UE��8�������D��y(/�T�5J�����E>C�Q�*�AQ��h�s�C3Y��J�Uq�(��sb�r�������@UY��Tb=OOT�p.�\����~!���a1���	��dt���sD����j12�f�u��j��f��tX7(��2V���(&��E�dU�7���h��YU�e��+$*�.�4Q�8�	5�f��8�����D�x��yJ�.jrQ��.u���������I#D�x2CM���`(+�R�O�W��Z.���X��1?����s�&�Z���@�l���R���LGW��<���i�XR?EU��"OQ�����5�/?�Hb�M�+>L�T������d%���:esM���������I�Nk�s��}HO�WJ�����zoxK�/o�Z�������1� c�a�\3�/|����w�X����(��f�uB]d�G��iV����+��+��.9jx^e�������T�cm	�Agi\�q����m���l��i&�I�"Q����/g����c���R�&��@��	���D����?���H�R<|����v��.�j�?k"�Iy��#���/OK�H�	=�g3�3����
�U����o�����b����L�-��b�F�\K�����G!�����������������O��Q�H���Q���r�A��~��6��^U�Y1�w�W5�f\�}�-OuK����2A)GU�}Y��/YtAV��^��u\VW���W�
y]��n��������*U��F
/R9���[2.@2���}7�e�e�)�����S��KR��7���S5o�������O�=������F�}*�X|�N��W����������f=\���5����nf�7WE����Wx�A���d{��F�������!IOeXE��e����Ni*���L��7�eru�+)�i.
���B�B��PK]�I	lZPK�B\Zmeta.xml�S���0��+��W��X)�(��)�Rn#��K�{d�2�����N2���rUWW�T��D/�{�l�hBPV:�m[�����@���=?k	\99t`C�A��3���W���G'��8���&�����+D��a|�V�$���-w��=������	�f�_�|k���X�of#%1�D=�	��l��uk2Z�%�O7�s�B��X�n12���\�o�uws�����S�l���6��XzA;+�a��b�bv����i�����KFY�o(*%������3�Y��9I����
o�������������S�>�o��?�k��-
�
�������� ������q4K7�P�-�c6,�'+��?A�
'��[��Ep�y�G_��b�%E�K���������y�aN�M
qV���6*������W��o��PKRWt��OPK�B\Zsettings.xml�[Qs�8~�_��{B�$�0�!��iLrw}�bk"k=�����F6�
�R��t��)����V���V���"bs�"o;��c���>�A���\~p>v~����z���K"��P�R��`1.[�
�N"x����I�������������O����*�j�4M�����V???�e���"��@��k�
n�������Oe[�2Sl��w�i:�����\����9�
"m���c-�� bkN!}��S���w���S]d����R-ch;�+�So��^����
{3U~�5���Wat��C�lk�>� ,T��8;i��?�H|H��7eAZ�P�;��X�������T����h���JS
����L�H���x��D����'�!Q�PRE��Sd���-�S���������):7fT�	`sB����5�q��a�&����#[�^�R�]5�/��d�������d5g���6 ���"�b�����1Ed@��Q"������X��9��t��&Y3�dEa9�Ky�`H9\#W)��z�9x
����x�f�B� �(�Ry�����)���(0�r� ����9�D���
����+Sc�`���[������^�}�������jb��PYP���+�c	ZoT�������=��y�W�u��Y��,f}C��[:�f�>
��"��26%����7�+��������Gt����k1_@`���=	i���lBw����"J�7D<�������F�!{f�u���h�Vl���rIDa?}�{�}"{����1��AXp�>�n�&dj} o�3�����������,d�4�^��O\����ca:�NpL�c�m! ��Y�5	c����~\5l�ns��w��O�T��x��U]'��.�t��r�����z�^���U�HP�O��� : _s��e���\����0��ly'A\E,�'
{�y	#�����{!�S z���������)�]I	w=Ac���B����y0X��"������o`�[�����X,^��2"��@��F����ju��
��TL��ga=J������z�:����E������'
n�70y�F��?�#g6��,W��L"C��VcDb�#T��I�����D���%�D,����5����&��a&��j�������[�c(m������Ew�bc��ul	�f�8�A��'���������r/{��+���>��T'�
�����P�p���n��P�P�lwFg�X����f�mD�?PKi���]1PK�B\Z�B�0xCxCThumbnails/thumbnail.png�PNG


IHDRU���2fPLTE$$$)))444;;;BBBKKKSSS[[[ccckkkssszzz���������������������������������������������������BZx	pHYs�� IDATx����b�:E��}�$��/��$�� p^����I:�S�!���-m��hW
��^Y��z=��^Y��z=��^Y��z=��^Y��z=��^Y��z=��^Y��z=��^Y��z=��^Y��z=��^Y��z=��^Y��z=��^Y��z=��^Y��z=��^Y��z=��^Y��ze�J���+�WV����+�WV����+�WV��g����P�h�|���##�o���)|<����U_R���U�A�|�Y���i��Us�Zq{C�}Ia�u/z�<^Y��jQ/�/���qQi�-����s���8�}`��\[�
_=����&�����Y�"����4���.���;}�9�i��ng��[]��V�2[��-�f��f_}��jEY��|�M�?fuv56��Jk-���c2��Y�����������WY�0��W����:������j�Up�j����
�����DkD�WF��COd��y����h�02��mbk�U��L���p,�z=�q�����aAMf����:B������X�M}�L�8$sCV-�y�F���>\�nN0��#���� ����
��5�6!���e�f8��s����mZ=����fu������r�����a�-�i�uJch��]�:����J��j:f�2��+�KV�0Ib�_�I�W��!���K�
��o��S������$����������w������}���|v��)f�v��"�WV��^�+�WV��^�+�WV��^Y�WV��^Y�WV��^Y��7������~$��a����>,�G������_}@g���N���SYeWV?�U��J_g!���-r�w����� ���+�(��$�������[��ba�kz��WVy����7 ����%������v<�B���B����X�W�4������9�53�F����?����q2xn�Of5�^������"�4�5@�YkV�$�	'��Y����X��	�3�q--��CV9H]��5�������\V'���n�g��2A&��v���j|������	{���61�<Z�E�T�WAW��r��g�����j�5b|`$���?��a�b7e�1*�a����um*�����KV����b�L�����e����1l��A^�SV���u`�uT$����O0�RV�����������d�	6m�)��X�GK�cV���p��Z��d����U\A���e5|�������W�/(���6h1P*���"�$��v��V�p�\�`����I�v���R<<KV��8�bu��k��p�%�Y	�LER�����s%������e	���s�p��������+x3|��8���������W�pC~+����� ������'���}��h'k��>~�N�E����Y_������;n�� ��-��|�����c����}UC�_~�h�W���-����F��e\���������%���h�O�X;�4�#��_B��GF,
R���3�_��Z��$����U6J�[Z�|�OOo��)Xj��k
���;�Z�,-����,~e_�����;���;<�`c��L���j
f,����#�=�� ��S�������E�������*��$&�Y�����Qki;����R,���������?��J�#����� d�_~j\6�i�
�~�L�����A������?6�EJ�]ZL�yn���,����2��i�����+����}����i��Q��+�?/�F��MK��%K�����c<��-�<�@�C-���g\�{�N$�����-l���A��t4~L�jX�a'l�����'Sn~$h���^-����������PV�ZjV���%�{�g}H&IA����r����d�GZ�s}�Fy��V�3�I�?�v����������y�:9����ch�������N?�W��+4/���
�K'�d�����@@A��"�j�E��[h'E&t��������gZ��}�����d�����5l�=���r�J�Qh��P�_���	�F�_�����"�8��ii:�d�U2]KA��n��eh�������,2M])��q����|K��Fl��K�'8�8uT\�uv��w�j�d�VpK��2�������:'[����C���z��~�e�P_]����]��������-u��6�Z��}5�?�a�s4s�Pv�5�D_�)���#��X!P�@3m���j������l�������|���c�4��a�DS��A�~}���/�k�r��G��FKin����L�r���+?�����	U�_VBBlL,����.rn{Qa&����n���0`x��?�S�z�������i����@�o�w���j���o^��jh����S������~�^.��|c�������-}��7W+���������se��@�L�Zn�s�=0����#�L�c���o�c�cr�~�����!�_w�5��Y&L"^�V:O&�9t�15F?��2SN�s;�������<���H�dL�����y�y������%��\��
��";��]N��'o�����L����wEP��|0�y���6��A�����m��
����0f5��1tCt<}���*&���<��1:�j�q������� �������I	�t�
��9C*s��p����oO�A'�Y�7;�:�1�]p�����H��A��||�b�abG���[�s]��������[$�8I����X%\��`d~1�e�?���YB�������s�O�����YS��Ag/���i�in�9:�x1{�������B����Q�|lPeCL1.�B?�+/�f_�4;
��K5������Sf.6.��V���K����fv�^^ ��+3V#4z��s����x�� ���)O���*-�{��BY�[�*�M���b�p0���n$��FY��A�}�@V=��,!(x�N���c:��n�n4��)B<��.y{Y|�A��^4������n8w��Fn�7�BK���d���?i�Z�Pt�F/����g���$������J�:x��/!S�����L��,T�6�2h\�Aq<�=�p<���r�	B��e_�5V���,�_����������Z��j�p)����?G���j�$8�OtD���cO�����%��cu������@���^�����x�Y�(.`H	�D^�����ZTD�x�a��'m�N�7�m��#^�"�
e�r�����u�Y�k�d��{(����3���6
2�I�hyld���Q��Q�[Yc4�D/oA�����JK�Nk|��c���}��y��������n��r�m�:J��@m[��G+�9�����E�=���Q��E�C-�8�{�>I��c��M�=k������4z���>h�S������c�^�Vfe��{-*?/�L4�h����acn���,����k6���u|�b�_��
�U��"��?����������+�Y_���o��U����������U�N/����I1���i�V��7�Z>:���jm�,��R�,�G�	Q��0� �w��x ���0�{�����l����Jj���ql�e��q�9��Z'[Ao���^Sk�;g���u��R�8� X0_����%`f�g���
�a�0��k�m���[��>�J����AOa�7���)p���Z�F���g��W�P�5 ����1����Is�P��B*&j-Z,x��y��>(m&\����	f^�F~�Am�t|����=��`j	�w.��
:�o��u�"�0+�/�d:��Z������(����4�N1��y������1���o����)�����
������L��E�w�z����3�fw�:��Z0PNQk`Bfv��D�[d���0O�}�K�t.���_��T���6`~I����h����7 b
�G�z���7jmO��~���f�'�5�/}�Ij
��I��#�_��&-��3f����R�-���Z��k��>{#o�T��Zo����}�E�|�Z������*��tc��\����~����p�6�dATyOA��+j��t���A���}y��;��|}����n��0�������P_-�^8�
���{�A9�Hy��s2�
�-��y�������%0��nd��v�\c�M�>����Z+N�Yp��ukb���$���%f8�Z��<�UZ��'><������j������7�.����>N���x�j����Mj=�����#��#.�j��s#:RM�y�
���r���3y�03b��t0����@�5�:F��R�m�����������p������Y�(�Q�o�2?D��[d�/�z����^Y=�W�O�`�����tW_e�ij���Wi�����
Ej�_�������j����/�N���8�i]
�����D�Z?e��O'Y4�Z:tO��
R�N��P��QLY��5���}��;j=hF����8]�h�� Xm�O�����#�l���p:(#pl���dv�Mej}C����=��pK�gYr����O�-�uc�G<��"w�a����_�����^j�]�%1(�����$�f�*�n(��4��{"�DW������Z�1S�ty���W�n�$��,X�Z?�u6�d�B D���Y�Z'����N�=t�&�Kb�u*��j��zU]��Z���2��Q�H��J�oA���P�,���
Q�����NQx�-�"}#���XadT.n���[j=�����k��Lz�[�f����`Ej�T*&e�t{�d�j��	�g���
3	=���Y�Q�5�������i�5��&�,Ph���T0ofu#�<
��6/\��sEj]<���6p�v�8��X�8��%��N�>����x5K���)���z���eP�.�*�����*R�����W!��}���97};���
�dutZk
�OJ�#��j�?2�I\�g�+R��A)����}�2�Y���k��P�M���7�!���=2�^����3�����h���:W���"����R�o�JYt
�r$�e\�Z����9����oc*T�I��afui��������zoP����Z?~����^kWV����3��?	�����l��O}D��5o�
���ZkH��*_{f��BJr�H9ag6Gnv����e�@-^)!&�PWjm���C�@�Bk=��	�)&mG�SZk�����/B�|7
k��A=hj���������^Y�!jm��R.Q�`4�H���l�B�i����g��zV�e�S5�!��������
�fRq���FfE0�sUj�8���H|]��m�mGvh�'=�TK�=f��!���H����r��l#��?�w(�>~���Z����������=��,PFw�����b�Z�;��DK��J��A;�+s��0BW���p�M}��a�����z�
����	�5�R�������sA�uLZ�{�
��n�.<���{�����nA���"���>9�H.���E�����e;���^����?������Z�JP������]���`��j]t!)��_�'ph1�[�#Zk8$U�d��wy�����Qo���
9v�k���%$�?N�W������J�������j��#��_��B�g��Z#PG���`��]|e�(<�l�����9��B�������:��W���l��o�u�����c�.��G�
p���#(|HR������g����ZW��C��W�.����d�vd�������-���������5��_����W������_Sk�K���������f+�����������
j��	�P����
����1���
���z��Fj
?���L�uL���6����E[�C�q:���5�(�\�9�53�t��|`^d��3"�^}�����
��A������Oz-p���,�s�W��J������i��R�M>��9�5��##�+�,e�� �����-�����e���ZRkRX��x@��`����_��n�pau�nx)�����z�z�F�>@�0����E�i4[I�{�����u	n�D�	����<&�,�Ii	����Z������r>�����i���__�Ek�in�����'�'����
_k 7D��p9�#n�h����p�t�-Rk�p�U8Y�r>��&����N��R�|z��&�Z���\�7�zs�~}���V���="k'��p�]�>5���E�Q��#B$
O�_;��[j��y�������RM}u�hA�J�4*��
�;_kSct�+2tP�M7i8,�kI��8 	E7��[��I���x�`�6�5t���Zk�;m��������������)��������������9�Y������g)���v]�aTk-�/���������jA�I�w(�_�Z�y+S�vR.��)�������L`���X�q�FCP������Z�n�Z���Z�]�4w�uL�������^�cRE%?,��:�q��!���)�-�W��f��������?��V�Ao�������W/�ze���Z����p�;�����/6�=��~�}����q��b�;��,��������,8����`L}H�-��<���Z�������.���C�,�y�*\�,�m�>�k
���;��ZP�l:���3YP��RkX#-�ZW�A�d�dA�t)��x��#�
�As�{�~��z��+��O�Z�j7~��(�hF�Rk]���Z3����z����C"	���I�5
��������<�OVc����6�;Y�-�9�E�D�(
_�����;�5�/�cA��C�)�%�9��'���n��e��5�����'}�����#�	0�7_�C(|�Z�����)��@���!�R�X--���'���X�zK�+4�&?�k������`&_�=��j���v�;����X\U+����(�Qw�s�~Z�������Zk4���s!��>��%!����eA��Zk�B]��Z�>y�Lj%B�[�x�=(�l�Zk�u����}��a�A�
������@g�}��W_�c(\j��v

:�u(Y n�V���!$�B(��S{N7H�5��4N�Z�5���:��w��}p3��C�H�L���0
�Zk���a�Qh�G����Y�%�G����c]��V�Xh�[=��������l�O�Z�y6��W����`��N�p����Ek���
�DW,���"K<B��Z���B�O}������Z�W�e�j���!����W/�z�������z��B�c���^��/*�O��L��wZ���C|}[k�#�?�k���g1���+&z$�Q,�� ���0�c������A^&���_�G���e�:�W�$��i�EOD���-ZS,��(�.�O�� `v����	0�:S�Z�������_7�C9R��Z�Z}��z�E������h�]�.��R����$��	gA�?�
�����U��,&����tuINk��>��������h�-d��,��I��R8�A�-�
�Y�#����_��&���2�Z�U�-B:���z�6�G|�Q����p"Q�������?b���n�|�\A�dX&r�M��Rk�x����Z�U�-�5�d8���cU_�o,�Q�a$��?��[_���B�f\��C��-��Uk���^&��6�#�Ep�+�N��fl��z��A�����-ZP�hF�	4bR+��,�~�i�aq�,�lG��:^�L��z��D�����~Fk-e������,�B�2�EOQ�
�0��U'>	�%`n���t#f�b�����:��/,
(\�����je� �u\k-N
n*���E[Y"?"�Jj=�Ag�j��'�dz��C�p��:�0w�YEk�����
/�
������C�9����T���8��,�/n��Z�[t�N<�k}�����V������^&Bk��uA�Qv=^Z��Wj��p�|�p�;��}���i�O������W/�z���Bk�]���.j����]}�}@k��d�?����?��~mb��k�[������|PU��\���lA�����Z��0�2�B��(���NWU��k�`������z��*Y��	��p�q8!�^t��j IDAT�1J�\8�Qk�`����#��&�K
b��`G��R>�m��Z:�Tct�SZk4F�4������k6s'e�����~��&�l]�"G[u������z1���2��������P��������y�qxW�Y0N�k��Y�(�����[8��Zh���p�k}_����_���vRk���C��v�EK�+%�G�u��Yv���H�#�%i�	�5(U�|�_k���Tc���u��4���a��s�C�E�g�-:��a_k4�� j�;���8`V�Zki������.�1��2����pT`��}W^h��#�������	t�`V-����^���v$��,�'��B^/!��������ce���Yk���u_�p�c����V��~�ZK���lGz�����Zki�M!v����j������mZ~Nk
@���k���XP/,X���Ak���n��gP�sZki;2"��q�7�}���Z�I���Z�F2�P�p��(�����0�^lG��'���vd���[���Z�)�5�i����lh��k\��5W��_����W��zi�?�a��W��ZC��	�W��sZ�'_k����;|�[�����uo=�ZKY4���
'>�Zs*��Pk�Z;G��4�����k�~#���{[k=�z����^k�}�g)�>Z8�{p����Y6�����0��v���k��|�g��-���m�5��Is?��~�������Hu}�Z/������Z�k���Zk��[�k�_k�
���Q��O����F/�l�Z�G_����\8D�W�p\f���=��F/&|��Ek]8m����m���m�������
)}�Y4Z{���w���Z8��f��yHkM^&����W_�\�:��������\��������D��C��$������|�Q���;p��::��^&�q��6`�w7Bpe��������Y|7h-P�p����,GI���'���Zx��g�o��@q��B�Mj=�2Cy�#��,����������h��G�R8q�f���Ch���Zk�e��{_���������q�!�\k6����,}7R�}�Z�n,��{pO���������^&���5�
R��_�������\���5������O�������Zx����z�Z7|}3������Z��A_�\+j��C7jA/�z����^}����g�5�O�����^Y4����k�~�Z�po��Zk�b��y��j�B
���I��e#�5[d��aj�(���X�qS�EN!4"�����Y�&_��khr��5������AZ#,����nj]X`�1����� ;V��w������,{[���f���\��{w_���rQ���
���w(�B�EP��p���%�*������&�
�� �����-�|Jk�5�a��
��_XPO�-�����;A�E�����]�����zp��Y6���|Nk�Y����W���u�5�G+_Mp��s�Z����j��%�w/J<>���f _|���Z��r�j�5�1�����y0]^�qj-�6+
��m��_Rk�>�H��
w�\�����i�!dlaa�Z�����$Y[��`~��8��Y�@�K5�-Y�KU�
���{����.��������6�E|�YNQ����0�q���7G�5�y�XP
��-�%%�^��B�=Hw".���7�}�9^P��Z�	_�S����� B\��oA-N8�Ab��5MA�\VcT-��Pk���R����{�q��w��m_k^��c��'������8����,��zH�����
�
���e�_��}��7'_�R�����������k��ZP?j�����|B�������_�����?�������k���� `~_8q��{Ck��/��Z�\��(��W�j�XP#`v���zF���>�kM�h�
]�ui������3uj-���br����Dk�� ���fh�|�Z/Zk��Lf�q�;�	-��5S���`�$Q������y���RZ>i�g����T@�d_w-�Rk=��Z/�:.��:�����y`
=��%�V��
d��Z��_8��x������!���\��rU�SK���7f����p5�{��
���E�������-v2i"P8:�,fsgmXP��ZS���,PK�likv���tM����_��{�g��X�0	0J��\Y�e�AE7��lk���{�5�4�����h�Y���ZS_-�j��-�o�Y�8� a��l���JY�����N��#���zwP�h�u�
��/-=���f���5���VZP\)uo����_k!�%
��3�~��c�N~F�������yh��]�q�M�9��sZk�&�Ew=YP�cUV�Q�����,Z�p13�o[P?k��{����T�qm�i���ZO}K?�������hu����io^�1�>��9��a,���/�L��7����-��ZKj����'���(�(-����\�Z�AWYt����2kW3�M�^&2(�Z-������z��Zk�Q�5?D06Q�����Z�����^|���W_������o�*g���GK<��Z��q�����Z�H���3��|7JW�f��,�3���qjm!���h����(�Z�j��R��^^!j�n����6[:���!(v4�}�ZK�
?�P
y��6��zn�f�23Z�������5W��([J����eb���n�N�5��u���5
�{��so�E!Lguj-�<V.��b�Z�������
�T�ZKj����VK�k�1��O���w�\��U��|[=h{��ZSP+�y�*�Z��[Z�f%�Z��LD5�y��
��y�mX�Sk�� �+\�_(P���yG���Xv��a���5Vc�R�����,np���t���j�:dg����`����/��5�����%Y�+�Z�j���R,iki��	9pc��j�~��d��j�E���F#�mn��(8��ZK���'���k-�1&KK�Tz����n����Zk�����}�Z[�4����E������>�hS}��=T�1[ZJ����U���p�����U���A��Pk�q��Jt
Y4,�5G�>Rkx�����5TcLl�EK���vm�h�8ps�������X��*�OP��wC\G�,��
z3�����1��}��j��������#Uc��|i����k}��|+���ORk��j��������W�����~�X_�Z��~�!���c���L��'/�'�����>�5��Y������f�&��5�:��j���d;�TcT5��LK,�lG��������(|������Ij-}7��&Y4��}>�?��Tc\K<Vd;�Gp�V5�.���5�w�/��7j=���m>���<�bI�B��<�,��%����'lG�(���%e5�b����n�Th�����PkY7r���ZK�:���m��yTO���0��)�8���#���x�G��C�K0��cv�Z�3����P�
>�s!)��Y�)j-�<��,z��AV�9�,�#�C�GZ!5(�x\�5�vD����~���(|��-����Of����
�F&��*�Wc�+�X=����K<.��������y���(|���7��T'|����&�&��{!?Q�����#(W����r���Zs��)_oj=	j=m����Zo�Wp�����I�:S���M��j�.���#��X����f��P��6��|�T���
�,��������	��{3�{Y4x\�0�T�B���<���c���\��e#�6��y���8��:X�bA��m��!�����r��y<���8�7�Q�.�7�x������Y���S���j��H����D����j���9������MMRk�26���_����*�U�q/���L~lU���������n,*������:w�� �.�F"��_)�u�p	�xK�b�k��~O����h!���)������u���w�����hT���EK
��U

o ���d;���+���|8�g�Z��!�Q�K\�Z4����)Lx�,/mbz2��4T�4$Z
7�<����f��u��=�f�mC��j,������`��zAs:!�8��5F���.�:��j�q��US�2���t�*�q���>���ly�&\�@kcdC��D�`��W������V��A�����x�]�����x*��u|��"(zY+G�������Z#
0
&�u����"�������@������
w
BO�@-��������-h��R9(��X|��Z��F%�9lf�?�����Oy-��#�@�`���F�d����V�_��	H����b8�����A+'��?^����G�@�)�����>����*���E`�����,�*�]��p�D����czk�
V�q�p��B���l�)��
j
����@�����T}3�X���|(3�
w�����E���
Cg"!YQVk�R��
���������.�
��g��qTY���{���J���l7j=<x�7�z���u�?���[u�@�;�$�+�������M�S�Zk�����6�P�,�,�m��=�9��*�A���coL��W�.qI��lF�Z�w�-��A�hK���k�Xx��B���c���5�m�	c_eIF�S���O�gK�)������b����&�.DV�%�T��k	����q��9�X�\GR�A���\[�l*f2(�YM�;>Y���(�:�n�lY�7������}3/v�y<�3
���V�����A������^���v�u&��������K��0���-.��zY���"�u��u���Y,q�6*�8��QY��a�Ot�^)���g)a�07�)t��n��sv0h��eb�K�i�4��p�K�.�|��0w����h�]y^8��@/b��s�)r����I8]	��w0�*�[�����j�1�m��O�
_��#
��o��W:(��������?��~`��X�?X.��%7�2��+��q���'����C�������I�#����y�!j��t�S��%����p�����{�b��83�.���d����fs���TP8���[KX��W��on��ZG�m�?B�a��pJ�V����S�QO�8���Y�`zZ�E���A���m%���uEP��B����Q8" 8���4���|������.��uA�:�&�����&�V�`��4��z!���J��Z�:��p"n���T@�F�Up�`4DK9����8�z����$��_Qk���a:�S��0/0�fvY������W�����jPC�V�P���l�T�Q8����"����h��jc�����q���3�����0h�Q_����!ou�Yi}�9�������h��G�A�_�+Mv�14�R+���
�(��0-��^�@[�'�g3[e\]����~����K���{U������ZmX:F�c�Qq[f�{��-��SP1h��=�#���qk��WE��a[�����, 7�����w���Z/�V_���{�W��D"P8�8�W�6$88��pRtw��RP��U�������%
�������v��Zj�zR�F8����������
�M7������Y��u�
��T�+%�hP��^��J5��hd��zh3ij=`'(�����~�*�4X�����s���������B�����5������YkmMgk�W%��5ZK�������)���&_����EpGV�IZ0�n)� �~*�p|� Z*��jE|��|����/�5���������Sk`�%����HI�2.�f�p?E�*L/`nOjA�����6�]����'���v{�R�����3�������{{&B?��>������>j�Z���o:O�O=��n��/���������������e���?=�!����O}���_Z����&j
��N�����[s�7��a�\�n�����%�\C�R]��
��b���*�*�������������Z�A�
d�|�Z;��B��������LY�����������"�x���R�m�t@��GR��D#��p{�e��Z��H���������9���6���o{�`�$x����Q�"����GL/v�^pp��@�Z�$sx�>����=��|�Z��n
�zQ0�����g��%(~�����`���e]L�5��Ao����N�.j;�+���Z]k��Z3�05-�OP��
��c���Qw����KL0����D�����t����)�q+P������m��E����j���;j
Y5���Zw�Zo/������`�k�����/�������&���d��U��p�J}�������H>�#9��A��M|3��Z���57L��c��Ir��`�����nR
z���C��=����T6���-�^Vv�"U���,�d;�g�%��	�Rkzx�9A��W
��#��~P�M�?�'��	r�����2'������+�P\�*L����2s:���2�r?����+�f�f��Y}��p�
^(���4��p�:�	��!�|���U���YEp���_��k�1�cxk�	���i��RkO����������n���7��R�[0�)]���(�������=i�ij!U�p/�F<�����;������:�	j}P�|� ��A��(����/�z���Zk5��������5���w��_��+�����Jk�c�=���C�o��c�����Z��E��)�����q��TG`�����"�;/
��8����{o�>j�mzOs�!��Y`"sX���Z�@�3�;@���e.mG�-A<�)a��tp?x���,������{���4��>���d:�Pk��+vLo5������#\���@����#� /8��p��k�����k�����!����wcQ��g�9�
����v���?EUxg���er�-���Q�?��O!���Zs�Z���0� u	
�[C������dn�vd^[�1�5�����7�����x���ZJNUBj-���g��DU�
���e��#�����|Xk�4{��n�0/!|�(�5jZ�$��A��0��p���A_�2YlGH+�?������S!Ll�Z-��r���k�=���zA�z7Y��|��������[�c��z�{�J����CH�!]�^3�$�6�2�@���Z-h�@�'����c7��)]����IG�#�lU����\����
0O!��W.L'��wCj�����d;�?Rk���AF�y��u���Z���?{�����
�
Z������B~0�!�����^|��������{�����Q����?��;2#��y�>��#�/�Z`=����5��i������KP)�.�����G�R��i�3���C���Z�*�}I��Oh�S�	5��Z�z�h���SiA���Jj��Ai*(���Z�Z���VX��f��N���V�Z�q{���Zk�Z�F��B��
��+��&j�WA�t��Q�A�#u7����Q�{�u�T DKfp��Y6�Z�m��-���3���^y	�o��,w��d���^$w��l�c�%{p���Zk��J&~YVj��cf��4���9�k��ZC%B]�?C���o�Y��P����W7�� ��p�4}W#A�"�.X��t+���fp�
/D_k�
�������:���p�G|��O&	��CHm��-��P8
��2`��XLn$��U��@����y�c������]����gm�����f���-?�k��a'����nA��x�nZ`����A+Vx�U�����}�=w$�z�H���(�����n���+�V|���!����Z3�eD��z
�Hcod�\�F�$��x/Hkj�Zj�'����.��Z�Z����=,;��1��k�t�Zg7_�	��
�E��V��^P�p'�I_6.\����wo��Z�ZK�5��v9/�j��E�l�k
6o�o��Zk���]'��Z���������Z+[P/(\
�������o��P�Ek
�9��~C<j��Z�������������Ov��`n��z�S|�_|����Z�����Z���/A�7?�Q�gy�
��Nk��t���J�������
'��B�-�E!/��O����#�*Vc|�Z/(���Y�p���o-_;���<R�����IDAT���Y�5�������������������N
��TG�1
���P8x�z��q���B5��Z����Ux�d;"�1��~�����(����Z�W��Z�����k��Bk=�1�p�HY�D�Y����Z/�:m��%U�������P^k=	�e�Wl�����n���>��g��'
Z��y<h�
�����s�h�Q=������^��v&���Z��k���Z�
����}�����6�V�8$��_�Uc�i��Z�d�v�L��y��P�r��Z������h�o���������*�u �.Pk3���=ir��������5���1�E�=��H��5�g��O����Vc�S/���z*�l���[_k14��!���\��y��ZK���ES=Q���2e�4�"�d;�0x�2�z��u'���3������!.,LRk��e��<��6QkY�,Z�pf��:�N��Uq�;WA��b����Z3����Z��kY�V����	�Y��s�.���V�q�ZW�<(|�E�Yv����-�B����V�-�eUq-�As`��X��/�
��;�<��Z��d)m�������f/��
�d��9�/w��.�:���+,�����B�+���������Zk~�-��,��Rk~���om��_�]�������W�G�h�����IEND�B`�PK�B\ZMETA-INF/manifest.xml��Aj�0E�9���XJ�M��(��Tk����F��}����!��x��������l6��oY�m�o�����`�v��
��Lr.��;���a%�*�,Qy��:"�]��$������[xg�0����dBA����!�q������0)������j@J��S����c��]_�@[U�9B�T��v�!qB���ru�c����Xc�3��}I=��-��|����%a�x���<���$��*����6���yY���1��{RO�f ��??����T�eAs�#�wD�W=�q.��N����PKm��Q1,PK�B\Z�l9�..mimetypePK�B\ZTConfigurations2/accelerator/PK�B\Z�Configurations2/images/Bitmaps/PK�B\Z�Configurations2/toolpanel/PK�B\ZConfigurations2/floater/PK�B\Z9Configurations2/statusbar/PK�B\ZqConfigurations2/toolbar/PK�B\Z�Configurations2/progressbar/PK�B\Z�Configurations2/popupmenu/PK�B\ZConfigurations2/menubar/PK�B\Z���x>
�<
Ostyles.xmlPK�B\Z��h���manifest.rdfPK�B\Z]�I	lZcontent.xmlPK�B\ZRWt��OG*meta.xmlPK�B\Zi���]1,settings.xmlPK�B\Z�B�0xCxCI1Thumbnails/thumbnail.pngPK�B\Zm��Q1,�tMETA-INF/manifest.xmlPKekv

#84

Amit Kapila

amit.kapila16@gmail.com

11 months ago

In reply to: Shlok Kyal (#74)

Re: long-standing data loss bug in initial sync of logical replication

On Mon, Feb 24, 2025 at 4:49 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

Patches need a rebase. Attached the rebased patch.

I would like to discuss 0002 patch:
publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
{
publications_valid = false;
-
- /*
- * Also invalidate per-relation cache so that next time the filtering info
- * is checked it will be updated with the new publication settings.
- */
- rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
}

/*
@@ -1970,18 +1964,6 @@ init_rel_sync_cache(MemoryContext cachectx)
rel_sync_cache_publication_cb,
(Datum) 0);

- /*
- * Flush all cache entries after any publication changes. (We need no
- * callback entry for pg_publication, because publication_invalidation_cb
- * will take care of it.)
- */
- CacheRegisterSyscacheCallback(PUBLICATIONRELMAP,
- rel_sync_cache_publication_cb,
- (Datum) 0);
- CacheRegisterSyscacheCallback(PUBLICATIONNAMESPACEMAP,
- rel_sync_cache_publication_cb,
- (Datum) 0);

In 0002 patch, we are improving the performance by avoiding
invalidation processing in a number of cases. Basically, the claim is
that we are unnecessarily invalidating all the RelSyncCache entries
when a particular relation's entry could be invalidated. I have not
verified it, but IIUC, this should be an independent improvement atop
HEAD; if so, then we should start a separate thread to discuss it.

Thoughts?

--
With Regards,
Amit Kapila.

#85

Zhijie Hou (Fujitsu)

houzj.fnst@fujitsu.com

11 months ago

In reply to: Benoit Lobréau (#83)

RE: long-standing data loss bug in initial sync of logical replication

On Friday, February 28, 2025 4:28 PM Benoit Lobréau <benoit.lobreau@dalibo.com> wrote:

It took me a while but I ran the test on my laptop with 20 runs per test. I asked
for a dedicated server and will re-run the tests if/when I have it.

count of partitions | Head (sec) | Fix (sec) | Degradation (%)
----------------------------------------------------------------------
1000 | 0,0265 | 0,028 | 5,66037735849054
5000 | 0,091 | 0,0945 | 3,84615384615385
10000 | 0,1795 | 0,1815 | 1,11420612813371

Concurrent Txn | Head (sec) | Patch (sec) | Degradation in %
---------------------------------------------------------------------
50 | 0,1797647 | 0,1920949 | 6,85907744957
100 | 0,3693029 | 0,3823425 | 3,53086856344
500 | 1,62265755 | 1,91427485 | 17,97158617972
1000 | 3,01388635 | 3,57678295 | 18,67676928162
2000 | 7,0171877 | 6,4713304 | 8,43500897435

I'll try to run test2.pl later (right now it fails).

hope this helps.

Thank you for testing and sharing the data!

A nitpick with the data for the Concurrent Transaction (2000) case. The results
show that the HEAD's data appears worse than the patch data, which seems
unusual. However, I confirmed that the details in the attachment are as expected,
so, this seems to be a typo. (I assume you intended to use a
decimal point instead of a comma in the data like (8,43500...))

The data suggests some regression, slightly more than Shlok’s findings, but it
is still within an acceptable range for me. Since the test script builds a real
subscription for testing, the results might be affected by network and
replication factors, as Amit pointed out, we will share a new test script soon
that uses the SQL API xxx_get_changes() to test. It would be great if you could
verify the performance using the updated script as well.

Best Regards,
Hou zj

#86

Benoit Lobréau

benoit.lobreau@dalibo.com

10 months ago

In reply to: Zhijie Hou (Fujitsu) (#85)

Re: long-standing data loss bug in initial sync of logical replication

On 3/3/25 8:41 AM, Zhijie Hou (Fujitsu) wrote:

A nitpick with the data for the Concurrent Transaction (2000) case. The results
show that the HEAD's data appears worse than the patch data, which seems
unusual. However, I confirmed that the details in the attachment are as expected,
so, this seems to be a typo. (I assume you intended to use a
decimal point instead of a comma in the data like (8,43500...))

Hi,

Argh, yes, sorry! I didn't pay enough attention and accidentally
inverted the Patch and Head numbers in the last line when copying them
from the ODS to the email to match the previous report layout.

The comma is due to how decimals are written in my language (comma
instead of dot). I forgot to "translate" it.

Concurrent Txn | Head (sec) | Patch (sec) | Degradation in %
---------------------------------------------------------------------
50 | 0.1797647 | 0.1920949 | 6.85907744957
100 | 0.3693029 | 0.3823425 | 3.53086856344
500 | 1.62265755 | 1.91427485 | 17.97158617972
1000 | 3.01388635 | 3.57678295 | 18.67676928162
2000 | 6.4713304 | 7.0171877 | 8.43500897435

as Amit pointed out, we will share a new test script soon
that uses the SQL API xxx_get_changes() to test. It would be great if you could
verify the performance using the updated script as well.

Will do.

--
Benoit Lobréau
Consultant
http://dalibo.com

#87

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

10 months ago

In reply to: Amit Kapila (#82)

1 attachment(s)

RE: long-standing data loss bug in initial sync of logical replication

Dear hackers,

I found that the patch needs to be rebased due to ac4494, PSA new version.
It could be applied atop HEAD and tests worked well.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v18-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchapplication/octet-stream; name=v18-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchDownload

From e34c0ceebdd5ec2339450ba2c03d3818f2cf419b Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v18] Distribute invalidatons if change in catalog tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  54 +++-
 src/include/replication/reorderbuffer.h       |   5 +
 src/test/subscription/t/100_bugs.pl           | 267 ++++++++++++++++++
 4 files changed, 314 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 977fbcd247..1172afb3c5 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -222,9 +222,6 @@ int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED
  */
 static ReorderBufferTXN *ReorderBufferAllocTXN(ReorderBuffer *rb);
 static void ReorderBufferFreeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -630,7 +627,7 @@ ReorderBufferFreeRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index b64e53de01..16acb50614 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -161,7 +161,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -720,18 +720,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -749,6 +752,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -758,13 +769,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_is_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -774,6 +785,20 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1045,8 +1070,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3be0cbd7eb..481d547407 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -758,6 +758,11 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
+											   TransactionId xid, bool create,
+											   bool *is_new, XLogRecPtr lsn,
+											   bool create_as_top);
+
 extern void StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 83120f1cb6..d6dbeebed5 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -477,6 +477,273 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION pub1");
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION sub1");
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+));
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+));
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION");
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even when
+# tables are dropped from the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Maintain an active transaction with the table that will be dropped from the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+));
+
+# Maintain an active transaction with a schema table that will be dropped from the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (4);
+));
+
+# Drop the table from the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+	INSERT INTO sch3.tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is not replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that data for table tab_conc are not replicated to subscriber');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table sch3.tab_conc are not replicated to subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is dropped in a concurrent active transaction.
+
+# Add tables to the publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (6);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (6);
+));
+
+# Drop publication.
+$background_psql3->query_safe("DROP PUBLICATION regress_pub1");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (7)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (7)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+my $offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is renamed in a concurrent active transaction.
+
+# Create publication.
+$background_psql3->query_safe(
+	"CREATE PUBLICATION regress_pub1 FOR TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Create subscription.
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (8);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (8);
+));
+
+# Rename publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 RENAME TO regress_pub1_rename");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (9)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (9)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+$offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.43.5

#88

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

10 months ago

In reply to: Hayato Kuroda (Fujitsu) (#87)

5 attachment(s)

RE: long-standing data loss bug in initial sync of logical replication

Hi Hackers,

Our team (mainly Shlok) did a performance testing with several workloads. Let me
share them on -hackers. We did it for master/REL_17 branches, and in this post
master's one will be discussed.

The observed trend is:
We observed that the performance regression exists primarily during frequent
execution of publication DDL statements that modify published tables. This is
expected due to the necessary cache rebuild and distribution overhead involved.
The regression is small or nearly nonexistent in scenarios where DDLs do not
affect published tables or when the frequency of such DDL statements is low.

Used source
========
The base code was HEAD plus some modifications, which could selectively invalidate
a relsync caches. It is now pushed by 3abe9d. The compared patch was v16.

We did five benchmarks, let me share one by one.

-----

Workload A: No DDL operation done in concurrent session
======================================
In this workload, number of concurrent transactions were varied, but none of them
contained DDL commands. Decoding time of all transactions were measured and compared.
We expected that the performance would not be changed because any of caches could
be invalidated. Actual workload is noted in [1]1. Created a publisher on a single table, say 'tab_conc1'; 2. 'n +1' sessions are running in parallel 3. Now: All 'n' sessions : BEGIN; Insert a row in table 'tab_conc1'; In a session : Insert a row in table 'tab_conc1'; Insert a row in table 'tab_conc1' All 'n' sessions : Insert a row in table 'tab_conc1'; COMMIT; 4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes. and runner is attached.

Below table contains a result. We could not find notable degradations.

Concurrent txn | Head (sec) | Patch (sec) | Degradation (%)
------------------ | ------------ | ------------ | ----------------
50 | 0.013196 | 0.013314 | 0.8968
100 | 0.014531 | 0.014728 | 1.3558
500 | 0.018079 | 0.017969 | -0.6066
1000 | 0.023087 | 0.023772 | 2.9670
2000 | 0.031311 | 0.031750 | 1.4010

-----

Workload B: DDL is happening but is unrelated to publication
=======================================
In this workload, one of concurrent transactions contained a DDL, but it did not
related with the publication and publishing tables. We also expected that the
performance would not be changed.
Actual workload is noted in [2]1. Created a publisher on a single table, say 'tab_conc1'; 2. 'n +1' sessions are running in parallel 3. Now: All 'n' sessions : BEGIN; Insert a row in table 'tab_conc1' In a session : BEGIN; ALTER TABLE t1 ADD COLUMN b int; COMMIT; BEGIN; ALTER TABLE t1 DROP COLUMN b; COMMIT; All 'n' sessions : Insert a row in table 'tab_conc1'; COMMIT; 4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes. and runner is attached.

Below table contains a result. The patch we proposed distributes invalidation
messages to concurrent decoding transactions. It would be roughly proportional to
the concurrency, and what we observed proves the theory. Since inval messages
does not invalidate relsync caches, the difference is not so large.

Concurrent txn | Head (sec) | Patch (sec) | Degradation (%)
------------------ | ------------ | ------------ | ----------------
50 | 0.013410 | 0.013217 | -1.4417
100 | 0.014694 | 0.015260 | 3.8496
1000 | 0.023211 | 0.025376 | 9.3289
2000 | 0.032954 | 0.036322 | 10.2213

-----

Workload C. DDL is happening on publication but on unrelated table
===========================================
In this workload, one of concurrent transactions contained a DDL which altered
the using publication. But it just ADD/DROP table which was not being decoded.
Actual workload is noted in [3]Steps: 1. Created a publisher on a table, say 'tab_conc1', 't1'; 2. 'n +1' sessions are running in parallel 3. Now: All 'n' sessions : BEGIN; Insert a row in table 'tab_conc1' In a session : BEGIN; ALTER PUBLICATION regress_pub1 DROP TABLE t1; COMMIT; BEGIN; ALTER PUBLICATION regress_pub1 ADD TABLE t1; COMMIT; All 'n' sessions : Insert a row in table 'tab_conc1'; COMMIT; 4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes. and runner is attached.

Below table contains a result. Since the commit 3abe9dc, no need to rebuild the
whole of relsync cache anymore for the unrelated publish actions. Thus the
degradation was mostly same as B.

Concurrent txn | Head (sec) | Patch (sec) | Degradation (%)
------------------ | ------------ | ------------ | ----------------
50 | 0.013546 | 0.013409 | -1.0089
100 | 0.015225 | 0.015357 | 0.8648
500 | 0.017848 | 0.019300 | 8.1372
1000 | 0.023430 | 0.025152 | 7.3497
2000 | 0.032041 | 0.035877 | 11.9723

-----

Workload D. DDL is happening on the related published table,
and one insert is done per invalidation
=========================================
In this workload, one of concurrent transactions contained a DDL which altered
the using publication. Also, it DROP/ADD table which was being decoded. Actual
workload is noted in [4]1. Created a publisher on a single table, say 'tab_conc1'; 2. 'n + 1' sessions are running in parallel 3. Now: All 'n' sessions : BEGIN; Insert a row in table 'tab_conc1' In a session : BEGIN; Alter publication DROP 'tab_conc1'; COMMIT; BEGIN; Alter publication ADD 'tab_conc1'; COMMIT; All 'n' sessions : Insert a row in table 'tab_conc1'; COMMIT; 4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes. and runner is attached.

Below table contains a result. Apart from B and C, we could expect that this
workload had huge degradation, because each distributed message would require
the rebuild of relsync caches. This meant that caches were discarded and re-built
for every transaction. And the result showed around 300% regression for 2000
concurrent transactions.

IIUC it is difficult to avoid the regression with current design.

Concurrent txn | Head (sec) | Patch (sec) | Degradation (%)
------------------ | ------------ | ------------ | ----------------
50 | 0.013944 | 0.016460 | 18.0384
100 | 0.014952 | 0.020160 | 34.8322
500 | 0.018535 | 0.043122 | 132.6577
1000 | 0.023426 | 0.072215 | 208.2628
2000 | 0.032055 | 0.131884 | 311.4314

-----

Workload E. DDL is happening on the related published table,
and 1000 inserts are done per invalidation
===========================================
This workload was mostly same ad D, but the number of inserted tuples was 1000x.
We expected that rebuilding caches is not so dominant in the workload so that
the regression would be small.
Actual workload is noted in [5]1. Created a publisher on a single table, say 'tab_conc1'; 2. 'n +1' sessions are running in parallel 3. Now: All 'n' sessions : BEGIN; Insert 1000 rows in table 'tab_conc1' In a session : BEGIN; ALTER PUBLICATION regress_pub1 DROP 'tab_conc1'; COMMIT; BEGIN; ALTER PUBLICATION regress_pub1 ADD 'tab_conc1'; COMMIT; All 'n' sessions : Insert 1000 rows in table 'tab_conc1'; COMMIT; 4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes. and runner is attached.

Below contains result. Apart from D. there were not huge regression. This reasonable
result because decoding insertion 1000 times occupied much CPU time.

Concurrent txn | Head (sec) | Patch (sec) | Degradation (%)
------------------ | ------------ | ------------ | ----------------
50 | 0.093019 | 0.108820 | 16.9869
100 | 0.188367 | 0.199621 | 5.9741
500 | 0.967896 | 0.970674 | 0.2870
1000 | 1.658552 | 1.803991 | 8.7691
2000 | 3.482935 | 3.682771 | 5.7376

Thanks again Shlok to measure data.

[1]: 1. Created a publisher on a single table, say 'tab_conc1'; 2. 'n +1' sessions are running in parallel 3. Now: All 'n' sessions : BEGIN; Insert a row in table 'tab_conc1'; In a session : Insert a row in table 'tab_conc1'; Insert a row in table 'tab_conc1' All 'n' sessions : Insert a row in table 'tab_conc1'; COMMIT; 4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes.
1. Created a publisher on a single table, say 'tab_conc1';
2. 'n +1' sessions are running in parallel
3. Now:
All 'n' sessions :
BEGIN;
Insert a row in table 'tab_conc1';
In a session :
Insert a row in table 'tab_conc1';
Insert a row in table 'tab_conc1'
All 'n' sessions :
Insert a row in table 'tab_conc1';
COMMIT;
4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes.

[2]: 1. Created a publisher on a single table, say 'tab_conc1'; 2. 'n +1' sessions are running in parallel 3. Now: All 'n' sessions : BEGIN; Insert a row in table 'tab_conc1' In a session : BEGIN; ALTER TABLE t1 ADD COLUMN b int; COMMIT; BEGIN; ALTER TABLE t1 DROP COLUMN b; COMMIT; All 'n' sessions : Insert a row in table 'tab_conc1'; COMMIT; 4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes.
1. Created a publisher on a single table, say 'tab_conc1';
2. 'n +1' sessions are running in parallel
3. Now:
All 'n' sessions :
BEGIN;
Insert a row in table 'tab_conc1'
In a session :
BEGIN; ALTER TABLE t1 ADD COLUMN b int; COMMIT;
BEGIN; ALTER TABLE t1 DROP COLUMN b; COMMIT;
All 'n' sessions :
Insert a row in table 'tab_conc1';
COMMIT;
4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes.

[3]: Steps: 1. Created a publisher on a table, say 'tab_conc1', 't1'; 2. 'n +1' sessions are running in parallel 3. Now: All 'n' sessions : BEGIN; Insert a row in table 'tab_conc1' In a session : BEGIN; ALTER PUBLICATION regress_pub1 DROP TABLE t1; COMMIT; BEGIN; ALTER PUBLICATION regress_pub1 ADD TABLE t1; COMMIT; All 'n' sessions : Insert a row in table 'tab_conc1'; COMMIT; 4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes.
Steps:
1. Created a publisher on a table, say 'tab_conc1', 't1';
2. 'n +1' sessions are running in parallel
3. Now:
All 'n' sessions :
BEGIN;
Insert a row in table 'tab_conc1'
In a session :
BEGIN; ALTER PUBLICATION regress_pub1 DROP TABLE t1; COMMIT;
BEGIN; ALTER PUBLICATION regress_pub1 ADD TABLE t1; COMMIT;
All 'n' sessions :
Insert a row in table 'tab_conc1';
COMMIT;
4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes.

[4]: 1. Created a publisher on a single table, say 'tab_conc1'; 2. 'n + 1' sessions are running in parallel 3. Now: All 'n' sessions : BEGIN; Insert a row in table 'tab_conc1' In a session : BEGIN; Alter publication DROP 'tab_conc1'; COMMIT; BEGIN; Alter publication ADD 'tab_conc1'; COMMIT; All 'n' sessions : Insert a row in table 'tab_conc1'; COMMIT; 4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes.
1. Created a publisher on a single table, say 'tab_conc1';
2. 'n + 1' sessions are running in parallel
3. Now:
All 'n' sessions :
BEGIN;
Insert a row in table 'tab_conc1'
In a session :
BEGIN; Alter publication DROP 'tab_conc1'; COMMIT;
BEGIN; Alter publication ADD 'tab_conc1'; COMMIT;
All 'n' sessions :
Insert a row in table 'tab_conc1';
COMMIT;
4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes.

[5]: 1. Created a publisher on a single table, say 'tab_conc1'; 2. 'n +1' sessions are running in parallel 3. Now: All 'n' sessions : BEGIN; Insert 1000 rows in table 'tab_conc1' In a session : BEGIN; ALTER PUBLICATION regress_pub1 DROP 'tab_conc1'; COMMIT; BEGIN; ALTER PUBLICATION regress_pub1 ADD 'tab_conc1'; COMMIT; All 'n' sessions : Insert 1000 rows in table 'tab_conc1'; COMMIT; 4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes.
1. Created a publisher on a single table, say 'tab_conc1';
2. 'n +1' sessions are running in parallel
3. Now:
All 'n' sessions :
BEGIN;
Insert 1000 rows in table 'tab_conc1'
In a session :
BEGIN; ALTER PUBLICATION regress_pub1 DROP 'tab_conc1'; COMMIT;
BEGIN; ALTER PUBLICATION regress_pub1 ADD 'tab_conc1'; COMMIT;
All 'n' sessions :
Insert 1000 rows in table 'tab_conc1';
COMMIT;
4. run 'pg_logical_slot_get_binary_changes' to get the decoding changes.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#89

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

10 months ago

In reply to: Hayato Kuroda (Fujitsu) (#88)

RE: long-standing data loss bug in initial sync of logical replication

Hi hackers,

Our team (mainly Shlok) did a performance testing with several workloads. Let
me
share them on -hackers. We did it for master/REL_17 branches, and in this post
master's one will be discussed.

I posted benchmark results for master [1]/messages/by-id/OSCPR01MB149661EA973D65EBEC2B60D98F5D32@OSCPR01MB14966.jpnprd01.prod.outlook.com. In this post contains a result for
back branch, especially REL_17_STABLE.

The observed trend is the same as master's one:
Frequent DDL for publishing tables can cause huge regression, but this is expected.
For other cases, it is small or does not exist.

Used source
===========
The base code was HEAD of REL_17_STABLE, and compared patch was v16.
The large difference is that master tries to preserve relsync caches as much as
possible, but REL_17_STABLE discards them more aggressively.
Please refer recent commit, 3abe9d and 588acf6.

Executed workloads were mostly same as master's case.

-----

Workload A: No DDL operation done in concurrent session
======================================
No regression was observed in the workload.

Concurrent txn | Head (sec) | Patch (sec) | Degradation (%)
------------------ | ------------ | ------------ | ----------------
50 | 0.013706 | 0.013398 | -2.2496
100 | 0.014811 | 0.014821 | 0.0698
500 | 0.018288 | 0.018318 | 0.1640
1000 | 0.022613 | 0.022622 | 0.0413
2000 | 0.031812 | 0.031891 | 0.2504

-----

Workload B: DDL is happening but is unrelated to publication
========================================
Small regression was observed when the concurrency was huge. Because the DDL
transaction would send inval messages to all the concurrent transactions.

Concurrent txn | Head (sec) | Patch (sec) | Degradation (%)
------------------ | ------------ | ------------ | ----------------
50 | 0.013159 | 0.013305 | 1.1120
100 | 0.014718 | 0.014725 | 0.0476
500 | 0.018134 | 0.019578 | 7.9628
1000 | 0.022762 | 0.025228 | 10.8324
2000 | 0.032326 | 0.035638 | 10.2467

-----

Workload C. DDL is happening on publication but on unrelated table
============================================
We did not run the workload because we expected this could be same results as D.
588acf6 is needed to optimize the workload.

-----

Workload D. DDL is happening on the related published table,
and one insert is done per invalidation
=========================================
This workload had huge regression same as the master branch. This is expected
because distributed invalidation messages require all concurrent transactions
to rebuild relsync caches.

Concurrent txn | Head (sec) | Patch (sec) | Degradation (%)
------------------ | ------------ | ------------ | ----------------
50 | 0.013496 | 0.015588 | 15.5034
100 | 0.015112 | 0.018868 | 24.8517
500 | 0.018483 | 0.038714 | 109.4536
1000 | 0.023402 | 0.063735 | 172.3524
2000 | 0.031596 | 0.110860 | 250.8720

-----

Workload E. DDL is happening on the related published table,
and 1000 inserts are done per invalidation
============================================
The regression seen by D. cannot be observed. This is same as master's case and
expected because decoding 1000 tuples requires much time.

[1]: /messages/by-id/OSCPR01MB149661EA973D65EBEC2B60D98F5D32@OSCPR01MB14966.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#90

Amit Kapila

amit.kapila16@gmail.com

10 months ago

In reply to: Hayato Kuroda (Fujitsu) (#89)

Re: long-standing data loss bug in initial sync of logical replication

On Thu, Mar 13, 2025 at 2:12 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Workload C. DDL is happening on publication but on unrelated table
============================================
We did not run the workload because we expected this could be same results as D.
588acf6 is needed to optimize the workload.

-----

Workload D. DDL is happening on the related published table,
and one insert is done per invalidation
=========================================
This workload had huge regression same as the master branch. This is expected
because distributed invalidation messages require all concurrent transactions
to rebuild relsync caches.

Concurrent txn | Head (sec) | Patch (sec) | Degradation (%)
------------------ | ------------ | ------------ | ----------------
50 | 0.013496 | 0.015588 | 15.5034
100 | 0.015112 | 0.018868 | 24.8517
500 | 0.018483 | 0.038714 | 109.4536
1000 | 0.023402 | 0.063735 | 172.3524
2000 | 0.031596 | 0.110860 | 250.8720

IIUC, workloads C and D will have regression in back branches, and
HEAD will have regression only for workload D. We have avoided
workload C regression in HEAD via commits 7c99dc587a and 3abe9dc188.
We can backpatch those commits if required, but I think it is better
not to do those as scenarios C and D won't be that common, and we
should go ahead with the fix as it is. In the future, if we get any
way to avoid regression due to scenario-D, then we can do that for the
HEAD branch.

Thoughts?

--
With Regards,
Amit Kapila.

#91

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

10 months ago

In reply to: Amit Kapila (#90)

5 attachment(s)

RE: long-standing data loss bug in initial sync of logical replication

Dear Amit,

IIUC, workloads C and D will have regression in back branches, and
HEAD will have regression only for workload D. We have avoided
workload C regression in HEAD via commits 7c99dc587a and 3abe9dc188.

Right.

We can backpatch those commits if required, but I think it is better
not to do those as scenarios C and D won't be that common, and we
should go ahead with the fix as it is. In the future, if we get any
way to avoid regression due to scenario-D, then we can do that for the
HEAD branch.

OK, let me share patched for back branches. Mostly the same fix patched as master
can be used for PG14-PG17, like attached. Regarding the PG13, it cannot be applied
as-is thus some adjustments are needed. I will share it in upcoming posts.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v18_REL_14-0001-Distribute-invalidatons-if-change-in-cata.patchapplication/octet-stream; name=v18_REL_14-0001-Distribute-invalidatons-if-change-in-cata.patchDownload

From 5659f9007ffdfc4daca5f9659be6c546ef4d7c5c Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v18_REL_14] Distribute invalidatons if change in catalog
 tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  54 +++--
 src/include/replication/reorderbuffer.h       |   4 +
 src/test/subscription/t/100_bugs.pl           | 201 +++++++++++++++++-
 4 files changed, 246 insertions(+), 18 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 64d9baa7982..2456f5bd98c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -215,9 +215,6 @@ static const Size max_changes_in_memory = 4096; /* XXX for restore only */
  */
 static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
 static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -611,7 +608,7 @@ ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 9d8700147ca..d72b67777b0 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -290,7 +290,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
@@ -843,18 +843,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -872,6 +875,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -881,13 +892,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -897,6 +908,20 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1175,8 +1200,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d81b51..5c8833fe099 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -676,6 +676,10 @@ TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
 void		ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
+ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid,
+										bool create, bool *is_new,
+										XLogRecPtr lsn, bool create_as_top);
+
 void		StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cce91891ab9..3d99d937c9c 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 9;
+use Test::More tests => 12;
 
 # Bug #15114
 
@@ -299,6 +299,205 @@ is( $node_subscriber->safe_psql(
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+$node_publisher = get_new_node('node_publisher_invalidation');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->start;
+
+$node_subscriber = get_new_node('node_subscriber_invalidation');
+$node_subscriber->init();
+$node_subscriber->start;
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 2 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql2->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+));
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql2->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc"
+);
+
+# Complete the transaction on the table.
+$background_psql1->query_safe("COMMIT");
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION");
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+my $result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even when
+# tables are dropped from the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Maintain an active transaction with the table that will be dropped from the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+));
+
+# Drop the table from the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql2->query_safe(
+	"ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc"
+);
+
+# Complete the transaction on the table.
+$background_psql1->query_safe("COMMIT");
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is not replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that data for table tab_conc are not replicated to subscriber');
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is dropped in a concurrent active transaction.
+
+# Add tables to the publication.
+$background_psql2->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (6);
+));
+
+# Drop publication.
+$background_psql2->query_safe("DROP PUBLICATION regress_pub1");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (7)");
+
+# Complete the transaction on the table.
+$background_psql1->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+my $offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is renamed in a concurrent active transaction.
+
+# Create publication.
+$background_psql2->query_safe(
+	"CREATE PUBLICATION regress_pub1 FOR TABLE tab_conc"
+);
+
+# Create subscription.
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (8);
+));
+
+# Rename publication.
+$background_psql2->query_safe(
+	"ALTER PUBLICATION regress_pub1 RENAME TO regress_pub1_rename");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (9)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+$offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$background_psql1->quit;
+$background_psql2->quit;
+
+$node_publisher->stop('fast');
+$node_subscriber->stop('fast');
+
 # The bug was that when the REPLICA IDENTITY FULL is used with dropped or
 # generated columns, we fail to apply updates and deletes
 my $node_publisher_d_cols = get_new_node('node_publisher_d_cols');
-- 
2.43.5

v18_REL_15-0001-Distribute-invalidatons-if-change-in-cata.patchapplication/octet-stream; name=v18_REL_15-0001-Distribute-invalidatons-if-change-in-cata.patchDownload

From 66ba968def9ac774ea18287347437ce8f2796310 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v18_REL_15] Distribute invalidatons if change in catalog
 tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  54 +++-
 src/include/replication/reorderbuffer.h       |   5 +
 src/test/subscription/t/100_bugs.pl           | 267 ++++++++++++++++++
 4 files changed, 314 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d5cde20a1c9..e5c67c61ebd 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -215,9 +215,6 @@ static const Size max_changes_in_memory = 4096; /* XXX for restore only */
  */
 static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
 static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -615,7 +612,7 @@ ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index cc1f2a9f154..cd4d712f655 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -290,7 +290,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
@@ -852,18 +852,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -881,6 +884,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -890,13 +901,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -906,6 +917,20 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1184,8 +1209,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4a01f877e5d..0b20f5c0549 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -680,6 +680,11 @@ extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
+extern ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
+											   TransactionId xid, bool create,
+											   bool *is_new, XLogRecPtr lsn,
+											   bool create_as_top);
+
 extern void StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 28ca4affbb9..fe55369c80c 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -415,6 +415,273 @@ $result = $node_subscriber->safe_psql('postgres',
 is($result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION pub1");
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION sub1");
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+));
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+));
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION");
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even when
+# tables are dropped from the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Maintain an active transaction with the table that will be dropped from the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+));
+
+# Maintain an active transaction with a schema table that will be dropped from the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (4);
+));
+
+# Drop the table from the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+	INSERT INTO sch3.tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is not replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that data for table tab_conc are not replicated to subscriber');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table sch3.tab_conc are not replicated to subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is dropped in a concurrent active transaction.
+
+# Add tables to the publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (6);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (6);
+));
+
+# Drop publication.
+$background_psql3->query_safe("DROP PUBLICATION regress_pub1");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (7)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (7)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+my $offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is renamed in a concurrent active transaction.
+
+# Create publication.
+$background_psql3->query_safe(
+	"CREATE PUBLICATION regress_pub1 FOR TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Create subscription.
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (8);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (8);
+));
+
+# Rename publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 RENAME TO regress_pub1_rename");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (9)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (9)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+$offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.43.5

v18_REL_16-0001-Distribute-invalidatons-if-change-in-cata.patchapplication/octet-stream; name=v18_REL_16-0001-Distribute-invalidatons-if-change-in-cata.patchDownload

From 93611517dfd60e6cd2869d17ea9762bbb2707250 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v18_REL_16] Distribute invalidatons if change in catalog
 tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  54 +++-
 src/include/replication/reorderbuffer.h       |   5 +
 src/test/subscription/t/100_bugs.pl           | 267 ++++++++++++++++++
 4 files changed, 314 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 930549948a..2539e84f3d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -218,9 +218,6 @@ int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED
  */
 static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
 static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -619,7 +616,7 @@ ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 3ed2f79dd0..bf3cfdee48 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -288,7 +288,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -845,18 +845,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -874,6 +877,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -883,13 +894,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -899,6 +910,20 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1170,8 +1195,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3cb03168de..6403fb2689 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -748,6 +748,11 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
+											   TransactionId xid, bool create,
+											   bool *is_new, XLogRecPtr lsn,
+											   bool create_as_top);
+
 extern void StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 091da5a506..3fead828d0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -488,6 +488,273 @@ $result = $node_subscriber->safe_psql('postgres',
 is($result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION pub1");
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION sub1");
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+));
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+));
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION");
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even when
+# tables are dropped from the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Maintain an active transaction with the table that will be dropped from the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+));
+
+# Maintain an active transaction with a schema table that will be dropped from the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (4);
+));
+
+# Drop the table from the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+	INSERT INTO sch3.tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is not replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that data for table tab_conc are not replicated to subscriber');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table sch3.tab_conc are not replicated to subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is dropped in a concurrent active transaction.
+
+# Add tables to the publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (6);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (6);
+));
+
+# Drop publication.
+$background_psql3->query_safe("DROP PUBLICATION regress_pub1");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (7)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (7)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+my $offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is renamed in a concurrent active transaction.
+
+# Create publication.
+$background_psql3->query_safe(
+	"CREATE PUBLICATION regress_pub1 FOR TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Create subscription.
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (8);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (8);
+));
+
+# Rename publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 RENAME TO regress_pub1_rename");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (9)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (9)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+$offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.43.5

v18_REL_17-0001-Distribute-invalidatons-if-change-in-cata.patchapplication/octet-stream; name=v18_REL_17-0001-Distribute-invalidatons-if-change-in-cata.patchDownload

From 0d681d19832992088460f54b9af82970498daf39 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v18_REL_17] Distribute invalidatons if change in catalog
 tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  54 +++-
 src/include/replication/reorderbuffer.h       |   5 +
 src/test/subscription/t/100_bugs.pl           | 267 ++++++++++++++++++
 4 files changed, 314 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 9c742e96eb..f810132039 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -221,9 +221,6 @@ int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED
  */
 static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
 static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -627,7 +624,7 @@ ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ae676145e6..c603daf1e9 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -300,7 +300,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -859,18 +859,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -888,6 +891,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -897,13 +908,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -913,6 +924,20 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1184,8 +1209,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 7de50462dc..4fe0878c71 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -729,6 +729,11 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
+											   TransactionId xid, bool create,
+											   bool *is_new, XLogRecPtr lsn,
+											   bool create_as_top);
+
 extern void StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index cb36ca7b16..72aaaae272 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -487,6 +487,273 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION pub1");
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION sub1");
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+));
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+));
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION");
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even when
+# tables are dropped from the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Maintain an active transaction with the table that will be dropped from the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+));
+
+# Maintain an active transaction with a schema table that will be dropped from the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (4);
+));
+
+# Drop the table from the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+	INSERT INTO sch3.tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is not replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that data for table tab_conc are not replicated to subscriber');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table sch3.tab_conc are not replicated to subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is dropped in a concurrent active transaction.
+
+# Add tables to the publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (6);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (6);
+));
+
+# Drop publication.
+$background_psql3->query_safe("DROP PUBLICATION regress_pub1");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (7)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (7)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+my $offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is renamed in a concurrent active transaction.
+
+# Create publication.
+$background_psql3->query_safe(
+	"CREATE PUBLICATION regress_pub1 FOR TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Create subscription.
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (8);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (8);
+));
+
+# Rename publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 RENAME TO regress_pub1_rename");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (9)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (9)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+$offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.43.5

v18-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchapplication/octet-stream; name=v18-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchDownload

From e34c0ceebdd5ec2339450ba2c03d3818f2cf419b Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v18] Distribute invalidatons if change in catalog tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 .../replication/logical/reorderbuffer.c       |   5 +-
 src/backend/replication/logical/snapbuild.c   |  54 +++-
 src/include/replication/reorderbuffer.h       |   5 +
 src/test/subscription/t/100_bugs.pl           | 267 ++++++++++++++++++
 4 files changed, 314 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 977fbcd247..1172afb3c5 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -222,9 +222,6 @@ int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED
  */
 static ReorderBufferTXN *ReorderBufferAllocTXN(ReorderBuffer *rb);
 static void ReorderBufferFreeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
-											   TransactionId xid, bool create, bool *is_new,
-											   XLogRecPtr lsn, bool create_as_top);
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
@@ -630,7 +627,7 @@ ReorderBufferFreeRelids(ReorderBuffer *rb, Oid *relids)
  * (with the given LSN, and as top transaction if that's specified);
  * when this happens, is_new is set to true.
  */
-static ReorderBufferTXN *
+ReorderBufferTXN *
 ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 					  bool *is_new, XLogRecPtr lsn, bool create_as_top)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index b64e53de01..16acb50614 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -161,7 +161,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -720,18 +720,21 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *curr_txn;
+
+	curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL, InvalidXLogRecPtr, false);
 
 	/*
 	 * Iterate through all toplevel transactions. This can include
@@ -749,6 +752,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -758,13 +769,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_is_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -774,6 +785,20 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of inprogress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid && curr_txn->ninvalidations > 0)
+			ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+										  curr_txn->ninvalidations, curr_txn->invalidations);
 	}
 }
 
@@ -1045,8 +1070,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3be0cbd7eb..481d547407 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -758,6 +758,11 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
+											   TransactionId xid, bool create,
+											   bool *is_new, XLogRecPtr lsn,
+											   bool create_as_top);
+
 extern void StartupReorderBuffer(void);
 
 #endif
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 83120f1cb6..d6dbeebed5 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -477,6 +477,273 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');
 
+# Clean up
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION pub1");
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION sub1");
+
+# The bug was that the incremental data synchronization was being skipped when
+# a new table is added to the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Initial setup.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE PUBLICATION regress_pub1;
+));
+
+$node_subscriber->safe_psql(
+	'postgres', qq(
+	CREATE TABLE tab_conc(a int);
+	CREATE SCHEMA sch3;
+	CREATE TABLE sch3.tab_conc(a int);
+	CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1;
+));
+
+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
+
+# Initiate 3 background sessions.
+my $background_psql1 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql1->set_query_timer_restart();
+
+my $background_psql2 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+
+$background_psql2->set_query_timer_restart();
+
+my $background_psql3 = $node_publisher->background_psql(
+	'postgres',
+	on_error_stop => 0,
+	timeout => $psql_timeout_secs);
+$background_psql3->set_query_timer_restart();
+
+# Maintain an active transaction with the table that will be added to the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (1);
+));
+
+# Maintain an active transaction with a schema table that will be added to the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (1);
+));
+
+# Add the table to the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (2);
+	INSERT INTO sch3.tab_conc VALUES (2);
+));
+
+# Refresh the publication.
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION regress_sub1 REFRESH PUBLICATION");
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub1');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2),
+	'Ensure that the data from the sch3.tab_conc table is synchronized to the subscriber after the subscription is refreshed'
+);
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (3);
+	INSERT INTO sch3.tab_conc VALUES (3);
+));
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3),
+	'Verify that the incremental data for table sch3.tab_conc added after table synchronization is replicated to the subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even when
+# tables are dropped from the publication in presence of a concurrent active
+# transaction performing the DML on the same table.
+
+# Maintain an active transaction with the table that will be dropped from the
+# publication.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (4);
+));
+
+# Maintain an active transaction with a schema table that will be dropped from the
+# publication.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (4);
+));
+
+# Drop the table from the publication using background_psql, as the alter
+# publication operation will distribute the invalidations to inprogress txns.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# Perform an insert.
+$node_publisher->safe_psql(
+	'postgres', qq(
+	INSERT INTO tab_conc VALUES (5);
+	INSERT INTO sch3.tab_conc VALUES (5);
+));
+
+$node_publisher->wait_for_catchup('regress_sub1');
+
+# Verify that the insert is not replicated to the subscriber.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that data for table tab_conc are not replicated to subscriber');
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM sch3.tab_conc");
+is( $result, qq(1
+2
+3
+4),
+	'Verify that the incremental data for table sch3.tab_conc are not replicated to subscriber'
+);
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is dropped in a concurrent active transaction.
+
+# Add tables to the publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 ADD TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (6);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (6);
+));
+
+# Drop publication.
+$background_psql3->query_safe("DROP PUBLICATION regress_pub1");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (7)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (7)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+my $offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
+
+# The bug was that the incremental data synchronization was happening even after
+# publication is renamed in a concurrent active transaction.
+
+# Create publication.
+$background_psql3->query_safe(
+	"CREATE PUBLICATION regress_pub1 FOR TABLE tab_conc, TABLES IN SCHEMA sch3"
+);
+
+# Create subscription.
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub1"
+);
+
+# Maintain an active transaction with the table.
+$background_psql1->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO tab_conc VALUES (8);
+));
+
+# Maintain an active transaction with a schema table.
+$background_psql2->query_safe(
+	qq(
+	BEGIN;
+	INSERT INTO sch3.tab_conc VALUES (8);
+));
+
+# Rename publication.
+$background_psql3->query_safe(
+	"ALTER PUBLICATION regress_pub1 RENAME TO regress_pub1_rename");
+
+# Perform an insert.
+$background_psql1->query_safe("INSERT INTO tab_conc VALUES (9)");
+$background_psql2->query_safe("INSERT INTO sch3.tab_conc VALUES (9)");
+
+# Complete the transaction on the tables.
+$background_psql1->query_safe("COMMIT");
+$background_psql2->query_safe("COMMIT");
+
+# ERROR should appear on subscriber.
+$offset = -s $node_subscriber->logfile;
+$node_subscriber->wait_for_log(
+	qr/ERROR:  publication "regress_pub1" does not exist/, $offset);
+
+$background_psql1->quit;
+$background_psql2->quit;
+$background_psql3->quit;
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
-- 
2.43.5

#92

Amit Kapila

amit.kapila16@gmail.com

10 months ago

In reply to: Hayato Kuroda (Fujitsu) (#91)

1 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Mon, Mar 17, 2025 at 6:53 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

OK, let me share patched for back branches. Mostly the same fix patched as master
can be used for PG14-PG17, like attached.

A few comments:
==============
1.
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr
lsn, TransactionId xid)
{
dlist_iter txn_i;
ReorderBufferTXN *txn;
+ ReorderBufferTXN *curr_txn;
+
+ curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL,
InvalidXLogRecPtr, false);

The above is used to access invalidations from curr_txn. I am thinking
about whether it would be better to expose a new function to get
invalidations for a txn based on xid instead of getting
ReorderBufferTXN. It would avoid any layering violation and misuse of
ReorderBufferTXN by other modules.

2. The patch has a lot of tests to verify the same points. Can't we
have one simple test using SQL API based on what Andres presented in
an email [1]/messages/by-id/20231119021830.d6t6aaxtrkpn743y@awork3.anarazel.de?

3. I have made minor changes in the comments in the attached.

[1]: /messages/by-id/20231119021830.d6t6aaxtrkpn743y@awork3.anarazel.de

--
With Regards,
Amit Kapila.

Attachments:

v18-0001-amit.diff.txttext/plain; charset=US-ASCII; name=v18-0001-amit.diff.txtDownload

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 16acb506141..fd1a3e75b29 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -739,7 +739,8 @@ SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, Transact
 	/*
 	 * Iterate through all toplevel transactions. This can include
 	 * subtransactions which we just don't yet know to be that, but that's
-	 * fine, they will just get an unnecessary snapshot queued.
+	 * fine, they will just get an unnecessary snapshot and invalidations
+	 * queued.
 	 */
 	dlist_foreach(txn_i, &builder->reorder->toplevel_by_lsn)
 	{
@@ -787,12 +788,12 @@ SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, Transact
 								 builder->snapshot);
 
 		/*
-		 * Add invalidation messages to the reorder buffer of inprogress
+		 * Add invalidation messages to the reorder buffer of in-progress
 		 * transactions except the current committed transaction, for which we
 		 * will execute invalidations at the end.
 		 *
 		 * It is required, otherwise, we will end up using the stale catcache
-		 * contents built by the current transaction even after its decoding
+		 * contents built by the current transaction even after its decoding,
 		 * which should have been invalidated due to concurrent catalog
 		 * changing transaction.
 		 */
@@ -1071,8 +1072,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
 		/*
-		 * add a new catalog snapshot and invalidations messages to all
-		 * currently running transactions
+		 * Add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions.
 		 */
 		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}

#93

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

10 months ago

In reply to: Amit Kapila (#92)

5 attachment(s)

RE: long-standing data loss bug in initial sync of logical replication

Dear Amit,

A few comments:
==============
1.
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr
lsn, TransactionId xid)
{
dlist_iter txn_i;
ReorderBufferTXN *txn;
+ ReorderBufferTXN *curr_txn;
+
+ curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false, NULL,
InvalidXLogRecPtr, false);
The above is used to access invalidations from curr_txn. I am thinking
about whether it would be better to expose a new function to get
invalidations for a txn based on xid instead of getting
ReorderBufferTXN. It would avoid any layering violation and misuse of
ReorderBufferTXN by other modules.

Sounds reasonable. I introduced new function ReorderBufferGetInvalidations() which
obtains the number of invalidations and its list. ReorderBufferTXN() is not exported
anymore.

2. The patch has a lot of tests to verify the same points. Can't we
have one simple test using SQL API based on what Andres presented in
an email [1]?

You meant that we need to test only the case reported by the Andres, right? New
version did like that. To make the test faster test was migrated to isolation tester
instead of the TAP test.

3. I have made minor changes in the comments in the attached.

Thanks, I included.

PSA new version for PG 14-master. Special thanks for Hou to minimize the test code.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v19_REL_14-0001-Distribute-invalidatons-if-change-in-cata.patchapplication/octet-stream; name=v19_REL_14-0001-Distribute-invalidatons-if-change-in-cata.patchDownload

From 5e963a027076cfd139a43401d7262821523db0fb Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v19_REL_14] Distribute invalidatons if change in catalog
 tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 contrib/test_decoding/Makefile                |  3 +-
 .../expected/invalidation_distrubution.out    | 20 ++++++
 .../specs/invalidation_distrubution.spec      | 32 +++++++++
 .../replication/logical/reorderbuffer.c       | 23 +++++++
 src/backend/replication/logical/snapbuild.c   | 67 +++++++++++++++----
 src/include/replication/reorderbuffer.h       |  4 ++
 6 files changed, 134 insertions(+), 15 deletions(-)
 create mode 100644 contrib/test_decoding/expected/invalidation_distrubution.out
 create mode 100644 contrib/test_decoding/specs/invalidation_distrubution.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 82ba3f7df11..8b0b8cc3acf 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot catalog_change_snapshot skip_snapshot_restore
+	twophase_snapshot catalog_change_snapshot skip_snapshot_restore \
+	invalidation_distrubution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distrubution.out
new file mode 100644
index 00000000000..c701e290bb9
--- /dev/null
+++ b/contrib/test_decoding/expected/invalidation_distrubution.out
@@ -0,0 +1,20 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_alter_pub_add_tbl s1_commit s1_insert_tbl1 s2_get_binary_changes
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '2', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distrubution.spec
new file mode 100644
index 00000000000..b8b14e333a1
--- /dev/null
+++ b/contrib/test_decoding/specs/invalidation_distrubution.spec
@@ -0,0 +1,32 @@
+# Test that catalog cache invalidation messages are distributed to ongoing
+# transactions, ensuring they can access the updated catalog content after
+# processing these messages.
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+    CREATE TABLE tbl1(val1 integer, val2 integer);
+    CREATE PUBLICATION pub;
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    DROP PUBLICATION pub;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 (val1, val2) VALUES (1, 1); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2_alter_pub_add_tbl" { ALTER PUBLICATION pub ADD TABLE tbl1; }
+step "s2_get_binary_changes" { SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '2', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73; }
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_add_tbl" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 64d9baa7982..a52b51a5e7d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -5196,3 +5196,26 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Count invalidation messages of specified transaction.
+ *
+ * Returns number of messages, and msgs is set to the pointer of the linked
+ * list for the messages.
+ */
+uint32
+ReorderBufferGetInvalidations(ReorderBuffer *rb, TransactionId xid,
+							  SharedInvalidationMessage **msgs)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	if (txn == NULL)
+		return 0;
+
+	*msgs = txn->invalidations;
+
+	return txn->ninvalidations;
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 9d8700147ca..8abd669c51e 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -290,7 +290,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
@@ -843,15 +843,15 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
@@ -859,7 +859,8 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 	/*
 	 * Iterate through all toplevel transactions. This can include
 	 * subtransactions which we just don't yet know to be that, but that's
-	 * fine, they will just get an unnecessary snapshot queued.
+	 * fine, they will just get an unnecessary snapshot and invalidations
+	 * queued.
 	 */
 	dlist_foreach(txn_i, &builder->reorder->toplevel_by_lsn)
 	{
@@ -872,6 +873,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -881,13 +890,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -897,6 +906,33 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of in-progress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding,
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid)
+		{
+			uint32 ninvalidations;
+			SharedInvalidationMessage *msgs = NULL;
+
+			ninvalidations = ReorderBufferGetInvalidations(builder->reorder,
+														   xid, &msgs);
+
+			if (ninvalidations > 0)
+			{
+				Assert(msgs != NULL);
+
+				ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+											  ninvalidations, msgs);
+			}
+		}
 	}
 }
 
@@ -1175,8 +1211,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * Add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions.
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d81b51..415405e7cca 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -676,6 +676,10 @@ TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
 void		ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
+uint32		ReorderBufferGetInvalidations(ReorderBuffer *rb,
+										  TransactionId xid,
+										  SharedInvalidationMessage **msgs);
+	
 void		StartupReorderBuffer(void);
 
 #endif
-- 
2.43.5

v19_REL_15-0001-Distribute-invalidatons-if-change-in-cata.patchapplication/octet-stream; name=v19_REL_15-0001-Distribute-invalidatons-if-change-in-cata.patchDownload

From efd0ca3ad80243a0d704d9eb079483ecf9f707ea Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v19_REL_15] Distribute invalidatons if change in catalog
 tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 contrib/test_decoding/Makefile                |  2 +-
 .../expected/invalidation_distrubution.out    | 20 ++++++
 .../specs/invalidation_distrubution.spec      | 32 +++++++++
 .../replication/logical/reorderbuffer.c       | 23 +++++++
 src/backend/replication/logical/snapbuild.c   | 67 +++++++++++++++----
 src/include/replication/reorderbuffer.h       |  4 ++
 6 files changed, 133 insertions(+), 15 deletions(-)
 create mode 100644 contrib/test_decoding/expected/invalidation_distrubution.out
 create mode 100644 contrib/test_decoding/specs/invalidation_distrubution.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index a4ba1a509ae..eef70770674 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot \
-	skip_snapshot_restore
+	skip_snapshot_restore invalidation_distrubution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distrubution.out
new file mode 100644
index 00000000000..24190ebe570
--- /dev/null
+++ b/contrib/test_decoding/expected/invalidation_distrubution.out
@@ -0,0 +1,20 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_alter_pub_add_tbl s1_commit s1_insert_tbl1 s2_get_binary_changes
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '3', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distrubution.spec
new file mode 100644
index 00000000000..f63aba3ce96
--- /dev/null
+++ b/contrib/test_decoding/specs/invalidation_distrubution.spec
@@ -0,0 +1,32 @@
+# Test that catalog cache invalidation messages are distributed to ongoing
+# transactions, ensuring they can access the updated catalog content after
+# processing these messages.
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+    CREATE TABLE tbl1(val1 integer, val2 integer);
+    CREATE PUBLICATION pub;
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    DROP PUBLICATION pub;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 (val1, val2) VALUES (1, 1); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2_alter_pub_add_tbl" { ALTER PUBLICATION pub ADD TABLE tbl1; }
+step "s2_get_binary_changes" { SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '3', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73; }
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_add_tbl" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d5cde20a1c9..835180e12f6 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -5200,3 +5200,26 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Count invalidation messages of specified transaction.
+ *
+ * Returns number of messages, and msgs is set to the pointer of the linked
+ * list for the messages.
+ */
+uint32
+ReorderBufferGetInvalidations(ReorderBuffer *rb, TransactionId xid,
+							  SharedInvalidationMessage **msgs)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	if (txn == NULL)
+		return 0;
+
+	*msgs = txn->invalidations;
+
+	return txn->ninvalidations;
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index cc1f2a9f154..0b303f9a235 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -290,7 +290,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
@@ -852,15 +852,15 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
@@ -868,7 +868,8 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 	/*
 	 * Iterate through all toplevel transactions. This can include
 	 * subtransactions which we just don't yet know to be that, but that's
-	 * fine, they will just get an unnecessary snapshot queued.
+	 * fine, they will just get an unnecessary snapshot and invalidations
+	 * queued.
 	 */
 	dlist_foreach(txn_i, &builder->reorder->toplevel_by_lsn)
 	{
@@ -881,6 +882,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -890,13 +899,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -906,6 +915,33 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of in-progress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding,
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid)
+		{
+			uint32 ninvalidations;
+			SharedInvalidationMessage *msgs = NULL;
+
+			ninvalidations = ReorderBufferGetInvalidations(builder->reorder,
+														   xid, &msgs);
+
+			if (ninvalidations > 0)
+			{
+				Assert(msgs != NULL);
+
+				ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+											  ninvalidations, msgs);
+			}
+		}
 	}
 }
 
@@ -1184,8 +1220,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * Add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions.
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4a01f877e5d..402bb7a2728 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -680,6 +680,10 @@ extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
+extern uint32 ReorderBufferGetInvalidations(ReorderBuffer *rb,
+											TransactionId xid,
+											SharedInvalidationMessage **msgs);
+
 extern void StartupReorderBuffer(void);
 
 #endif
-- 
2.43.5

v19-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchapplication/octet-stream; name=v19-0001-Distribute-invalidatons-if-change-in-catalog-tab.patchDownload

From 2f36f7fed1a632b7a8de03cd6683724374b524ed Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v19] Distribute invalidatons if change in catalog tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 contrib/test_decoding/Makefile                |  2 +-
 .../expected/invalidation_distrubution.out    | 20 ++++++
 contrib/test_decoding/meson.build             |  1 +
 .../specs/invalidation_distrubution.spec      | 32 +++++++++
 .../replication/logical/reorderbuffer.c       | 23 +++++++
 src/backend/replication/logical/snapbuild.c   | 67 +++++++++++++++----
 src/include/replication/reorderbuffer.h       |  4 ++
 7 files changed, 134 insertions(+), 15 deletions(-)
 create mode 100644 contrib/test_decoding/expected/invalidation_distrubution.out
 create mode 100644 contrib/test_decoding/specs/invalidation_distrubution.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index a4ba1a509ae..eef70770674 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot \
-	skip_snapshot_restore
+	skip_snapshot_restore invalidation_distrubution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distrubution.out
new file mode 100644
index 00000000000..ad0a944cbf3
--- /dev/null
+++ b/contrib/test_decoding/expected/invalidation_distrubution.out
@@ -0,0 +1,20 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_alter_pub_add_tbl s1_commit s1_insert_tbl1 s2_get_binary_changes
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/meson.build b/contrib/test_decoding/meson.build
index 54d65d3f30f..d5f03fd5e9b 100644
--- a/contrib/test_decoding/meson.build
+++ b/contrib/test_decoding/meson.build
@@ -63,6 +63,7 @@ tests += {
       'twophase_snapshot',
       'slot_creation_error',
       'skip_snapshot_restore',
+      'invalidation_distrubution',
     ],
     'regress_args': [
       '--temp-config', files('logical.conf'),
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distrubution.spec
new file mode 100644
index 00000000000..decbed627e3
--- /dev/null
+++ b/contrib/test_decoding/specs/invalidation_distrubution.spec
@@ -0,0 +1,32 @@
+# Test that catalog cache invalidation messages are distributed to ongoing
+# transactions, ensuring they can access the updated catalog content after
+# processing these messages.
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+    CREATE TABLE tbl1(val1 integer, val2 integer);
+    CREATE PUBLICATION pub;
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    DROP PUBLICATION pub;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 (val1, val2) VALUES (1, 1); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2_alter_pub_add_tbl" { ALTER PUBLICATION pub ADD TABLE tbl1; }
+step "s2_get_binary_changes" { SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73; }
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_add_tbl" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 977fbcd2474..67655111875 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -5460,3 +5460,26 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Count invalidation messages of specified transaction.
+ *
+ * Returns number of messages, and msgs is set to the pointer of the linked
+ * list for the messages.
+ */
+uint32
+ReorderBufferGetInvalidations(ReorderBuffer *rb, TransactionId xid,
+							  SharedInvalidationMessage **msgs)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	if (txn == NULL)
+		return 0;
+
+	*msgs = txn->invalidations;
+
+	return txn->ninvalidations;
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index b64e53de017..97188d2e747 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -161,7 +161,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -720,15 +720,15 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
@@ -736,7 +736,8 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 	/*
 	 * Iterate through all toplevel transactions. This can include
 	 * subtransactions which we just don't yet know to be that, but that's
-	 * fine, they will just get an unnecessary snapshot queued.
+	 * fine, they will just get an unnecessary snapshot and invalidations
+	 * queued.
 	 */
 	dlist_foreach(txn_i, &builder->reorder->toplevel_by_lsn)
 	{
@@ -749,6 +750,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -758,13 +767,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_is_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -774,6 +783,33 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of in-progress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding,
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid)
+		{
+			uint32 ninvalidations;
+			SharedInvalidationMessage *msgs = NULL;
+
+			ninvalidations = ReorderBufferGetInvalidations(builder->reorder,
+														   xid, &msgs);
+
+			if (ninvalidations > 0)
+			{
+				Assert(msgs != NULL);
+
+				ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+											  ninvalidations, msgs);
+			}
+		}
 	}
 }
 
@@ -1045,8 +1081,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * Add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions.
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3be0cbd7ebe..24e88c409ba 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -758,6 +758,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern uint32 ReorderBufferGetInvalidations(ReorderBuffer *rb,
+											TransactionId xid,
+											SharedInvalidationMessage **msgs);
+
 extern void StartupReorderBuffer(void);
 
 #endif
-- 
2.43.5

v19_REL_17-0001-Distribute-invalidatons-if-change-in-cata.patchapplication/octet-stream; name=v19_REL_17-0001-Distribute-invalidatons-if-change-in-cata.patchDownload

From 49dc33b8782024d6b0c0f2b2fa8bc7741c3aed9a Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v19_REL_17] Distribute invalidatons if change in catalog
 tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 contrib/test_decoding/Makefile                |  2 +-
 .../expected/invalidation_distrubution.out    | 20 ++++++
 contrib/test_decoding/meson.build             |  1 +
 .../specs/invalidation_distrubution.spec      | 32 +++++++++
 .../replication/logical/reorderbuffer.c       | 23 +++++++
 src/backend/replication/logical/snapbuild.c   | 67 +++++++++++++++----
 src/include/replication/reorderbuffer.h       |  4 ++
 7 files changed, 134 insertions(+), 15 deletions(-)
 create mode 100644 contrib/test_decoding/expected/invalidation_distrubution.out
 create mode 100644 contrib/test_decoding/specs/invalidation_distrubution.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index a4ba1a509ae..eef70770674 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot \
-	skip_snapshot_restore
+	skip_snapshot_restore invalidation_distrubution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distrubution.out
new file mode 100644
index 00000000000..ad0a944cbf3
--- /dev/null
+++ b/contrib/test_decoding/expected/invalidation_distrubution.out
@@ -0,0 +1,20 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_alter_pub_add_tbl s1_commit s1_insert_tbl1 s2_get_binary_changes
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/meson.build b/contrib/test_decoding/meson.build
index f643dc81a2c..b31c433681d 100644
--- a/contrib/test_decoding/meson.build
+++ b/contrib/test_decoding/meson.build
@@ -63,6 +63,7 @@ tests += {
       'twophase_snapshot',
       'slot_creation_error',
       'skip_snapshot_restore',
+      'invalidation_distrubution',
     ],
     'regress_args': [
       '--temp-config', files('logical.conf'),
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distrubution.spec
new file mode 100644
index 00000000000..decbed627e3
--- /dev/null
+++ b/contrib/test_decoding/specs/invalidation_distrubution.spec
@@ -0,0 +1,32 @@
+# Test that catalog cache invalidation messages are distributed to ongoing
+# transactions, ensuring they can access the updated catalog content after
+# processing these messages.
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+    CREATE TABLE tbl1(val1 integer, val2 integer);
+    CREATE PUBLICATION pub;
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    DROP PUBLICATION pub;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 (val1, val2) VALUES (1, 1); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2_alter_pub_add_tbl" { ALTER PUBLICATION pub ADD TABLE tbl1; }
+step "s2_get_binary_changes" { SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73; }
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_add_tbl" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 9c742e96eb3..03eb005c39d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -5337,3 +5337,26 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Count invalidation messages of specified transaction.
+ *
+ * Returns number of messages, and msgs is set to the pointer of the linked
+ * list for the messages.
+ */
+uint32
+ReorderBufferGetInvalidations(ReorderBuffer *rb, TransactionId xid,
+							  SharedInvalidationMessage **msgs)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	if (txn == NULL)
+		return 0;
+
+	*msgs = txn->invalidations;
+
+	return txn->ninvalidations;
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ae676145e60..110e0b0a044 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -300,7 +300,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -859,15 +859,15 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
@@ -875,7 +875,8 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 	/*
 	 * Iterate through all toplevel transactions. This can include
 	 * subtransactions which we just don't yet know to be that, but that's
-	 * fine, they will just get an unnecessary snapshot queued.
+	 * fine, they will just get an unnecessary snapshot and invalidations
+	 * queued.
 	 */
 	dlist_foreach(txn_i, &builder->reorder->toplevel_by_lsn)
 	{
@@ -888,6 +889,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -897,13 +906,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -913,6 +922,33 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of in-progress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding,
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid)
+		{
+			uint32 ninvalidations;
+			SharedInvalidationMessage *msgs = NULL;
+
+			ninvalidations = ReorderBufferGetInvalidations(builder->reorder,
+														   xid, &msgs);
+
+			if (ninvalidations > 0)
+			{
+				Assert(msgs != NULL);
+
+				ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+											  ninvalidations, msgs);
+			}
+		}
 	}
 }
 
@@ -1184,8 +1220,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * Add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions.
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 7de50462dcf..4c56f219fd8 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -729,6 +729,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern uint32 ReorderBufferGetInvalidations(ReorderBuffer *rb,
+											TransactionId xid,
+											SharedInvalidationMessage **msgs);
+
 extern void StartupReorderBuffer(void);
 
 #endif
-- 
2.43.5

v19_REL_16-0001-Distribute-invalidatons-if-change-in-cata.patchapplication/octet-stream; name=v19_REL_16-0001-Distribute-invalidatons-if-change-in-cata.patchDownload

From c3a49990c6c2605a2a1834c6d520af173b3c6725 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v19_REL_16] Distribute invalidatons if change in catalog
 tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 contrib/test_decoding/Makefile                |  2 +-
 .../expected/invalidation_distrubution.out    | 20 ++++++
 contrib/test_decoding/meson.build             |  1 +
 .../specs/invalidation_distrubution.spec      | 32 +++++++++
 .../replication/logical/reorderbuffer.c       | 23 +++++++
 src/backend/replication/logical/snapbuild.c   | 67 +++++++++++++++----
 src/include/replication/reorderbuffer.h       |  4 ++
 7 files changed, 134 insertions(+), 15 deletions(-)
 create mode 100644 contrib/test_decoding/expected/invalidation_distrubution.out
 create mode 100644 contrib/test_decoding/specs/invalidation_distrubution.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index a4ba1a509ae..eef70770674 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot \
-	skip_snapshot_restore
+	skip_snapshot_restore invalidation_distrubution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distrubution.out
new file mode 100644
index 00000000000..ad0a944cbf3
--- /dev/null
+++ b/contrib/test_decoding/expected/invalidation_distrubution.out
@@ -0,0 +1,20 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_alter_pub_add_tbl s1_commit s1_insert_tbl1 s2_get_binary_changes
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/meson.build b/contrib/test_decoding/meson.build
index 2dd3ede41bf..273d26643c0 100644
--- a/contrib/test_decoding/meson.build
+++ b/contrib/test_decoding/meson.build
@@ -63,6 +63,7 @@ tests += {
       'twophase_snapshot',
       'slot_creation_error',
       'skip_snapshot_restore',
+      'invalidation_distrubution',
     ],
     'regress_args': [
       '--temp-config', files('logical.conf'),
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distrubution.spec
new file mode 100644
index 00000000000..decbed627e3
--- /dev/null
+++ b/contrib/test_decoding/specs/invalidation_distrubution.spec
@@ -0,0 +1,32 @@
+# Test that catalog cache invalidation messages are distributed to ongoing
+# transactions, ensuring they can access the updated catalog content after
+# processing these messages.
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+    CREATE TABLE tbl1(val1 integer, val2 integer);
+    CREATE PUBLICATION pub;
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    DROP PUBLICATION pub;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 (val1, val2) VALUES (1, 1); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2_alter_pub_add_tbl" { ALTER PUBLICATION pub ADD TABLE tbl1; }
+step "s2_get_binary_changes" { SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73; }
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_add_tbl" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 930549948af..fa04e829cc9 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -5283,3 +5283,26 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Count invalidation messages of specified transaction.
+ *
+ * Returns number of messages, and msgs is set to the pointer of the linked
+ * list for the messages.
+ */
+uint32
+ReorderBufferGetInvalidations(ReorderBuffer *rb, TransactionId xid,
+							  SharedInvalidationMessage **msgs)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	if (txn == NULL)
+		return 0;
+
+	*msgs = txn->invalidations;
+
+	return txn->ninvalidations;
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 3ed2f79dd06..d48ebb8337b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -288,7 +288,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -845,15 +845,15 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
@@ -861,7 +861,8 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 	/*
 	 * Iterate through all toplevel transactions. This can include
 	 * subtransactions which we just don't yet know to be that, but that's
-	 * fine, they will just get an unnecessary snapshot queued.
+	 * fine, they will just get an unnecessary snapshot and invalidations
+	 * queued.
 	 */
 	dlist_foreach(txn_i, &builder->reorder->toplevel_by_lsn)
 	{
@@ -874,6 +875,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -883,13 +892,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -899,6 +908,33 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of in-progress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding,
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid)
+		{
+			uint32 ninvalidations;
+			SharedInvalidationMessage *msgs = NULL;
+
+			ninvalidations = ReorderBufferGetInvalidations(builder->reorder,
+														   xid, &msgs);
+
+			if (ninvalidations > 0)
+			{
+				Assert(msgs != NULL);
+
+				ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+											  ninvalidations, msgs);
+			}
+		}
 	}
 }
 
@@ -1170,8 +1206,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * Add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions.
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3cb03168de2..8a32bbe28df 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -748,6 +748,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern uint32 ReorderBufferGetInvalidations(ReorderBuffer *rb,
+											TransactionId xid,
+											SharedInvalidationMessage **msgs);
+
 extern void StartupReorderBuffer(void);
 
 #endif
-- 
2.43.5

#94

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

10 months ago

In reply to: Hayato Kuroda (Fujitsu) (#91)

2 attachment(s)

RE: long-standing data loss bug in initial sync of logical replication

Dear hackers,

Regarding the PG13, it cannot be
applied
as-is thus some adjustments are needed. I will share it in upcoming posts.

Here is a patch set for PG13. Apart from PG14-17, the patch could be created as-is,
because...

1. WAL record for invalidation messages (XLOG_XACT_INVALIDATIONS) does not exist.
2. Thus the ReorderBufferChange for the invalidation does not exist.
Our patch tries to distribute it but cannot be done as-is.
3. Codes assumed that invalidation messages can be added only once.
4. The timing when invalidation messages are consumed is limited:
a. COMMAND_ID change is poped,
b. start of decoding a transaction, or
c. end of decoding a transaction.

Above means that invalidations cannot be executed while being decoded.
I created two patch sets to resolve the data loss issue. 0001 has less code
changes but could resolve a part of issue, 0002 has huge changes but provides a
complete solution.

0001 - mostly same as patches for other versions. ReorderBufferAddInvalidations()
was adjusted to allow being called several times. As I said above,
0001 cannot execute inval messages while decoding the transacitons.
0002 - introduces new ReorderBufferChange type to indicate inval messages.
It would be handled like PG14+.

Here is an example. Assuming that the table foo exists on both nodes, a
publication "pub" which publishes all tables, and a subscription "sub" which
subscribes "pub". What if the workload is executed?

```
S1 S2
BEGIN;
INSERT INTO foo VALUES (1)
ALTER PUBLICATION pub RENAME TO pub_renamed;
INSERT INTO foo VALUES (2)
COMMIT;
LR -> ?
```

With 0001, tuples (1) and (2) would be replicated to the subscriber.
An error "publication "pub" does not exist" would raise when new changes are done
later.

0001+0002 works more aggressively; the error would raise when S1 transaction is decoded.
The behavior is same as for patched PG14-PG17.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v19_REL_13-0001-Distribute-invalidatons-if-change-in-cata.patchapplication/octet-stream; name=v19_REL_13-0001-Distribute-invalidatons-if-change-in-cata.patchDownload

From e79dbf562133b4cf2f34a175aa494609f251bb13 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v19_REL_13 1/2] Distribute invalidatons if change in catalog
 tables

Distribute invalidations to inprogress transactions if the current
committed transaction change any catalog table.
---
 contrib/test_decoding/Makefile                |  2 +-
 .../expected/invalidation_distrubution.out    | 20 ++++++
 .../specs/invalidation_distrubution.spec      | 32 ++++++++++
 .../replication/logical/reorderbuffer.c       | 64 ++++++++++++++++---
 src/backend/replication/logical/snapbuild.c   | 63 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  4 ++
 6 files changed, 164 insertions(+), 21 deletions(-)
 create mode 100644 contrib/test_decoding/expected/invalidation_distrubution.out
 create mode 100644 contrib/test_decoding/specs/invalidation_distrubution.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 735b7e7653c..f122dc3a82d 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot \
-	skip_snapshot_restore
+	skip_snapshot_restore invalidation_distrubution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distrubution.out
new file mode 100644
index 00000000000..eb70eda9042
--- /dev/null
+++ b/contrib/test_decoding/expected/invalidation_distrubution.out
@@ -0,0 +1,20 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_alter_pub_add_tbl s1_commit s1_insert_tbl1 s2_get_binary_changes
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '1', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distrubution.spec
new file mode 100644
index 00000000000..ca051fc1e85
--- /dev/null
+++ b/contrib/test_decoding/specs/invalidation_distrubution.spec
@@ -0,0 +1,32 @@
+# Test that catalog cache invalidation messages are distributed to ongoing
+# transactions, ensuring they can access the updated catalog content after
+# processing these messages.
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+    CREATE TABLE tbl1(val1 integer, val2 integer);
+    CREATE PUBLICATION pub;
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    DROP PUBLICATION pub;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 (val1, val2) VALUES (1, 1); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2_alter_pub_add_tbl" { ALTER PUBLICATION pub ADD TABLE tbl1; }
+step "s2_get_binary_changes" { SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '1', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73; }
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_add_tbl" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 56c25e3a6da..fa9413fa2a0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2264,20 +2264,45 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 							  SharedInvalidationMessage *msgs)
 {
 	ReorderBufferTXN *txn;
+	MemoryContext oldcontext;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	/*
+	 * Collect all the invalidations under the top transaction, if available,
+	 * so that we can execute them all together.
+	 */
+	if (txn->toplevel_xid)
+	{
+		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, true, NULL, lsn,
+									true);
+	}
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			palloc(sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
+
+	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
@@ -3895,3 +3920,26 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Count invalidation messages of specified transaction.
+ *
+ * Returns number of messages, and msgs is set to the pointer of the linked
+ * list for the messages.
+ */
+uint32
+ReorderBufferGetInvalidations(ReorderBuffer *rb, TransactionId xid,
+							  SharedInvalidationMessage **msgs)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	if (txn == NULL)
+		return 0;
+
+	*msgs = txn->invalidations;
+
+	return txn->ninvalidations;
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 7546de96763..3bda41c5251 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -292,7 +292,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
@@ -861,15 +861,15 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
@@ -877,7 +877,8 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 	/*
 	 * Iterate through all toplevel transactions. This can include
 	 * subtransactions which we just don't yet know to be that, but that's
-	 * fine, they will just get an unnecessary snapshot queued.
+	 * fine, they will just get an unnecessary snapshot and invalidations
+	 * queued.
 	 */
 	dlist_foreach(txn_i, &builder->reorder->toplevel_by_lsn)
 	{
@@ -890,6 +891,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -898,7 +907,7 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, (uint32) (lsn >> 32), (uint32) lsn);
 
 		/*
@@ -908,6 +917,33 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of in-progress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding,
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid)
+		{
+			uint32 ninvalidations;
+			SharedInvalidationMessage *msgs = NULL;
+
+			ninvalidations = ReorderBufferGetInvalidations(builder->reorder,
+														   xid, &msgs);
+
+			if (ninvalidations > 0)
+			{
+				Assert(msgs != NULL);
+
+				ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+											  ninvalidations, msgs);
+			}
+		}
 	}
 }
 
@@ -1186,8 +1222,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * Add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions.
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5347597e92b..545cee891ed 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -463,6 +463,10 @@ TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
 void		ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
+uint32		ReorderBufferGetInvalidations(ReorderBuffer *rb,
+										  TransactionId xid,
+										  SharedInvalidationMessage **msgs);
+
 void		StartupReorderBuffer(void);
 
 #endif
-- 
2.43.5

v19_REL_13-0002-Backpatch-introducing-invalidation-messag.patchapplication/octet-stream; name=v19_REL_13-0002-Backpatch-introducing-invalidation-messag.patchDownload

From 2e4192717429c9675c675eebef00cbee93dd363f Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 17 Mar 2025 11:25:49 +0900
Subject: [PATCH v19_REL_13 2/2] Backpatch introducing invalidation messages in
 ReorderBufferChangeType

---
 .../replication/logical/reorderbuffer.c       | 72 +++++++++++++++++--
 src/include/replication/reorderbuffer.h       | 16 ++++-
 2 files changed, 80 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fa9413fa2a0..697b45675a6 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -484,6 +484,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 	}
 
 	pfree(change);
@@ -1883,7 +1888,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * see new catalog contents, so execute all
 						 * invalidations.
 						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
+						ReorderBufferExecuteInvalidations(txn->ninvalidations,
+														  txn->invalidations);
 					}
 
 					break;
@@ -1891,6 +1897,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+					/* Execute the invalidation messages locally */
+					ReorderBufferExecuteInvalidations(change->data.inval.ninvalidations,
+													  change->data.inval.invalidations);
 			}
 		}
 
@@ -1921,7 +1931,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations, txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1947,7 +1957,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations, txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2265,6 +2275,7 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 {
 	ReorderBufferTXN *txn;
 	MemoryContext oldcontext;
+	ReorderBufferChange *change;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
@@ -2302,6 +2313,16 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 		txn->ninvalidations += nmsgs;
 	}
 
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		palloc(sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
@@ -2310,12 +2331,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2725,6 +2746,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+		{
+			char	   *data;
+			Size		inval_size = sizeof(SharedInvalidationMessage) *
+				change->data.inval.ninvalidations;
+
+			sz += inval_size;
+
+			ReorderBufferSerializeReserve(rb, sz);
+			data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+			/* might have been reallocated above */
+			ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+			memcpy(data, change->data.inval.invalidations, inval_size);
+			data += inval_size;
+
+			break;
+		}
 	}
 
 	ondisk->size = sz;
@@ -2833,6 +2872,12 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+		{
+			sz += sizeof(SharedInvalidationMessage) *
+				change->data.inval.ninvalidations;
+			break;
+		}
 	}
 
 	return sz;
@@ -3120,6 +3165,19 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+		{
+			Size		inval_size = sizeof(SharedInvalidationMessage) *
+				change->data.inval.ninvalidations;
+
+			change->data.inval.invalidations =
+				MemoryContextAlloc(rb->context, inval_size);
+
+			/* read the message */
+			memcpy(change->data.inval.invalidations, data, inval_size);
+
+			break;
+		}
 	}
 
 	dlist_push_tail(&txn->changes, &change->node);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 545cee891ed..dff58a2fd8f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -63,7 +63,8 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
 	REORDER_BUFFER_CHANGE_TRUNCATE,
-	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
+	REORDER_BUFFER_CHANGE_INVALIDATION
 };
 
 /* forward declaration */
@@ -150,6 +151,13 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations; /* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation message */
+		}			inval;
 	}			data;
 
 	/*
@@ -467,6 +475,12 @@ uint32		ReorderBufferGetInvalidations(ReorderBuffer *rb,
 										  TransactionId xid,
 										  SharedInvalidationMessage **msgs);
 
+void		ReorderBufferAddInvalidationsForDistribute(ReorderBuffer *,
+													   TransactionId,
+													   XLogRecPtr lsn,
+													   Size nmsgs,
+													   SharedInvalidationMessage *msgs);
+
 void		StartupReorderBuffer(void);
 
 #endif
-- 
2.43.5

#95

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

10 months ago

In reply to: Hayato Kuroda (Fujitsu) (#93)

7 attachment(s)

RE: long-standing data loss bug in initial sync of logical replication

Dear hackers,

Attached patch set contains proper commit message. It briefly describes the background
and handlings. Regarding the PG13, the same commit message is used for 0001.
0002 is still rough.

Renamed backpatches to .txt, to make cfbot happy.

Thanks Hou working for it.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v20_REL_13-0001-Fix-data-loss-in-logical-replication.txttext/plain; name=v20_REL_13-0001-Fix-data-loss-in-logical-replication.txtDownload

From 09c4c9482290a126f716482132813c9b0c09abc8 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v20_REL_13 1/2] Fix data loss in logical replication

Previously, logical replication could lose data if one user modified a
publication to add a table while another user concurrently modified that table
and commit later than the publication modification transaction. The issue
arised during the decoding of transactions modifying the table: if the initial
catalog cache was built using a snapshot taken before the publication DDL
execution, all subsequent changes to that table were decoded with outdated
catalog cache, which caused them to be filtered from replication. This happened
because invalidation messages were only present in the publication modification
transaction, which was decoded before these subsequent changes.

This issue is not limited to publication DDLs; similar problems can occur with
ALTER TYPE statements executed concurrently with DMLs, leading to incorrect
decoding under outdated type contexts.

To address this, the commit improves logical decoding by ensuring that
invalidation messages from catalog-modifying transactions are distributed to
all concurrent in-progress transactions. This allows the necessary rebuild of
the catalog cache when decoding new changes, similar to handling historic
catalog snapshots (see SnapBuildDistributeNewCatalogSnapshot()).

Following this change, some performance regression is observed, primarily
during frequent execution of publication DDL statements that modify published
tables. This is an expected trade-off due to cache rebuild and distribution
overhead. The regression is minor or nearly nonexistent when DDLs do not affect
published tables or occur infrequently, making this a worthwhile cost to
resolve a longstanding data loss issue.

An alternative approach considered was to take a strong lock on each affected
table during publication modification. However, this would only address issues
related to publication DDLs and require locking every relation in the database
for publications created as FOR ALL TABLES, which is impractical. Thus, this
commit chooses to distribute invalidation messages as outlined above.
---
 contrib/test_decoding/Makefile                |  2 +-
 .../expected/invalidation_distrubution.out    | 20 ++++++
 .../specs/invalidation_distrubution.spec      | 32 ++++++++++
 .../replication/logical/reorderbuffer.c       | 64 ++++++++++++++++---
 src/backend/replication/logical/snapbuild.c   | 63 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  4 ++
 6 files changed, 164 insertions(+), 21 deletions(-)
 create mode 100644 contrib/test_decoding/expected/invalidation_distrubution.out
 create mode 100644 contrib/test_decoding/specs/invalidation_distrubution.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 735b7e7653c..f122dc3a82d 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot \
-	skip_snapshot_restore
+	skip_snapshot_restore invalidation_distrubution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distrubution.out
new file mode 100644
index 00000000000..eb70eda9042
--- /dev/null
+++ b/contrib/test_decoding/expected/invalidation_distrubution.out
@@ -0,0 +1,20 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_alter_pub_add_tbl s1_commit s1_insert_tbl1 s2_get_binary_changes
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '1', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distrubution.spec
new file mode 100644
index 00000000000..ca051fc1e85
--- /dev/null
+++ b/contrib/test_decoding/specs/invalidation_distrubution.spec
@@ -0,0 +1,32 @@
+# Test that catalog cache invalidation messages are distributed to ongoing
+# transactions, ensuring they can access the updated catalog content after
+# processing these messages.
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+    CREATE TABLE tbl1(val1 integer, val2 integer);
+    CREATE PUBLICATION pub;
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    DROP PUBLICATION pub;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 (val1, val2) VALUES (1, 1); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2_alter_pub_add_tbl" { ALTER PUBLICATION pub ADD TABLE tbl1; }
+step "s2_get_binary_changes" { SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '1', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73; }
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_add_tbl" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 56c25e3a6da..fa9413fa2a0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2264,20 +2264,45 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 							  SharedInvalidationMessage *msgs)
 {
 	ReorderBufferTXN *txn;
+	MemoryContext oldcontext;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	/*
+	 * Collect all the invalidations under the top transaction, if available,
+	 * so that we can execute them all together.
+	 */
+	if (txn->toplevel_xid)
+	{
+		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, true, NULL, lsn,
+									true);
+	}
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			palloc(sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
+
+	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
@@ -3895,3 +3920,26 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Count invalidation messages of specified transaction.
+ *
+ * Returns number of messages, and msgs is set to the pointer of the linked
+ * list for the messages.
+ */
+uint32
+ReorderBufferGetInvalidations(ReorderBuffer *rb, TransactionId xid,
+							  SharedInvalidationMessage **msgs)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	if (txn == NULL)
+		return 0;
+
+	*msgs = txn->invalidations;
+
+	return txn->ninvalidations;
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 7546de96763..3bda41c5251 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -292,7 +292,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
@@ -861,15 +861,15 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
@@ -877,7 +877,8 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 	/*
 	 * Iterate through all toplevel transactions. This can include
 	 * subtransactions which we just don't yet know to be that, but that's
-	 * fine, they will just get an unnecessary snapshot queued.
+	 * fine, they will just get an unnecessary snapshot and invalidations
+	 * queued.
 	 */
 	dlist_foreach(txn_i, &builder->reorder->toplevel_by_lsn)
 	{
@@ -890,6 +891,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -898,7 +907,7 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, (uint32) (lsn >> 32), (uint32) lsn);
 
 		/*
@@ -908,6 +917,33 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of in-progress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding,
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid)
+		{
+			uint32 ninvalidations;
+			SharedInvalidationMessage *msgs = NULL;
+
+			ninvalidations = ReorderBufferGetInvalidations(builder->reorder,
+														   xid, &msgs);
+
+			if (ninvalidations > 0)
+			{
+				Assert(msgs != NULL);
+
+				ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+											  ninvalidations, msgs);
+			}
+		}
 	}
 }
 
@@ -1186,8 +1222,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * Add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions.
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5347597e92b..545cee891ed 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -463,6 +463,10 @@ TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
 void		ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
+uint32		ReorderBufferGetInvalidations(ReorderBuffer *rb,
+										  TransactionId xid,
+										  SharedInvalidationMessage **msgs);
+
 void		StartupReorderBuffer(void);
 
 #endif
-- 
2.43.5

v20_REL_13-0002-Backpatch-introducing-invalidation-messag.txttext/plain; name=v20_REL_13-0002-Backpatch-introducing-invalidation-messag.txtDownload

From 23bb871f617e7085b5ea869e2a83dcb6d38c91aa Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 17 Mar 2025 11:25:49 +0900
Subject: [PATCH v20_REL_13 2/2] Backpatch introducing invalidation messages in
 ReorderBufferChangeType

---
 .../replication/logical/reorderbuffer.c       | 72 +++++++++++++++++--
 src/include/replication/reorderbuffer.h       | 16 ++++-
 2 files changed, 80 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fa9413fa2a0..697b45675a6 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -484,6 +484,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 	}
 
 	pfree(change);
@@ -1883,7 +1888,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * see new catalog contents, so execute all
 						 * invalidations.
 						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
+						ReorderBufferExecuteInvalidations(txn->ninvalidations,
+														  txn->invalidations);
 					}
 
 					break;
@@ -1891,6 +1897,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+					/* Execute the invalidation messages locally */
+					ReorderBufferExecuteInvalidations(change->data.inval.ninvalidations,
+													  change->data.inval.invalidations);
 			}
 		}
 
@@ -1921,7 +1931,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations, txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1947,7 +1957,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations, txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2265,6 +2275,7 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 {
 	ReorderBufferTXN *txn;
 	MemoryContext oldcontext;
+	ReorderBufferChange *change;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
@@ -2302,6 +2313,16 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 		txn->ninvalidations += nmsgs;
 	}
 
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		palloc(sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
@@ -2310,12 +2331,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2725,6 +2746,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+		{
+			char	   *data;
+			Size		inval_size = sizeof(SharedInvalidationMessage) *
+				change->data.inval.ninvalidations;
+
+			sz += inval_size;
+
+			ReorderBufferSerializeReserve(rb, sz);
+			data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+			/* might have been reallocated above */
+			ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+			memcpy(data, change->data.inval.invalidations, inval_size);
+			data += inval_size;
+
+			break;
+		}
 	}
 
 	ondisk->size = sz;
@@ -2833,6 +2872,12 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+		{
+			sz += sizeof(SharedInvalidationMessage) *
+				change->data.inval.ninvalidations;
+			break;
+		}
 	}
 
 	return sz;
@@ -3120,6 +3165,19 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+		{
+			Size		inval_size = sizeof(SharedInvalidationMessage) *
+				change->data.inval.ninvalidations;
+
+			change->data.inval.invalidations =
+				MemoryContextAlloc(rb->context, inval_size);
+
+			/* read the message */
+			memcpy(change->data.inval.invalidations, data, inval_size);
+
+			break;
+		}
 	}
 
 	dlist_push_tail(&txn->changes, &change->node);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 545cee891ed..dff58a2fd8f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -63,7 +63,8 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
 	REORDER_BUFFER_CHANGE_TRUNCATE,
-	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
+	REORDER_BUFFER_CHANGE_INVALIDATION
 };
 
 /* forward declaration */
@@ -150,6 +151,13 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations; /* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation message */
+		}			inval;
 	}			data;
 
 	/*
@@ -467,6 +475,12 @@ uint32		ReorderBufferGetInvalidations(ReorderBuffer *rb,
 										  TransactionId xid,
 										  SharedInvalidationMessage **msgs);
 
+void		ReorderBufferAddInvalidationsForDistribute(ReorderBuffer *,
+													   TransactionId,
+													   XLogRecPtr lsn,
+													   Size nmsgs,
+													   SharedInvalidationMessage *msgs);
+
 void		StartupReorderBuffer(void);
 
 #endif
-- 
2.43.5

v20_REL_14-0001-Fix-data-loss-in-logical-replication.txttext/plain; name=v20_REL_14-0001-Fix-data-loss-in-logical-replication.txtDownload

From 0c64fcb70e09530de659fbd15d7280bbd44547e7 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v20_REL_14] Fix data loss in logical replication

Previously, logical replication could lose data if one user modified a
publication to add a table while another user concurrently modified that table
and commit later than the publication modification transaction. The issue
arised during the decoding of transactions modifying the table: if the initial
catalog cache was built using a snapshot taken before the publication DDL
execution, all subsequent changes to that table were decoded with outdated
catalog cache, which caused them to be filtered from replication. This happened
because invalidation messages were only present in the publication modification
transaction, which was decoded before these subsequent changes.

This issue is not limited to publication DDLs; similar problems can occur with
ALTER TYPE statements executed concurrently with DMLs, leading to incorrect
decoding under outdated type contexts.

To address this, the commit improves logical decoding by ensuring that
invalidation messages from catalog-modifying transactions are distributed to
all concurrent in-progress transactions. This allows the necessary rebuild of
the catalog cache when decoding new changes, similar to handling historic
catalog snapshots (see SnapBuildDistributeNewCatalogSnapshot()).

Following this change, some performance regression is observed, primarily
during frequent execution of publication DDL statements that modify published
tables. This is an expected trade-off due to cache rebuild and distribution
overhead. The regression is minor or nearly nonexistent when DDLs do not affect
published tables or occur infrequently, making this a worthwhile cost to
resolve a longstanding data loss issue.

An alternative approach considered was to take a strong lock on each affected
table during publication modification. However, this would only address issues
related to publication DDLs and require locking every relation in the database
for publications created as FOR ALL TABLES, which is impractical. Thus, this
commit chooses to distribute invalidation messages as outlined above.
---
 contrib/test_decoding/Makefile                |  3 +-
 .../expected/invalidation_distrubution.out    | 20 ++++++
 .../specs/invalidation_distrubution.spec      | 32 +++++++++
 .../replication/logical/reorderbuffer.c       | 23 +++++++
 src/backend/replication/logical/snapbuild.c   | 67 +++++++++++++++----
 src/include/replication/reorderbuffer.h       |  4 ++
 6 files changed, 134 insertions(+), 15 deletions(-)
 create mode 100644 contrib/test_decoding/expected/invalidation_distrubution.out
 create mode 100644 contrib/test_decoding/specs/invalidation_distrubution.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 82ba3f7df11..8b0b8cc3acf 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot catalog_change_snapshot skip_snapshot_restore
+	twophase_snapshot catalog_change_snapshot skip_snapshot_restore \
+	invalidation_distrubution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distrubution.out
new file mode 100644
index 00000000000..c701e290bb9
--- /dev/null
+++ b/contrib/test_decoding/expected/invalidation_distrubution.out
@@ -0,0 +1,20 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_alter_pub_add_tbl s1_commit s1_insert_tbl1 s2_get_binary_changes
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '2', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distrubution.spec
new file mode 100644
index 00000000000..b8b14e333a1
--- /dev/null
+++ b/contrib/test_decoding/specs/invalidation_distrubution.spec
@@ -0,0 +1,32 @@
+# Test that catalog cache invalidation messages are distributed to ongoing
+# transactions, ensuring they can access the updated catalog content after
+# processing these messages.
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+    CREATE TABLE tbl1(val1 integer, val2 integer);
+    CREATE PUBLICATION pub;
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    DROP PUBLICATION pub;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 (val1, val2) VALUES (1, 1); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2_alter_pub_add_tbl" { ALTER PUBLICATION pub ADD TABLE tbl1; }
+step "s2_get_binary_changes" { SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '2', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73; }
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_add_tbl" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 64d9baa7982..a52b51a5e7d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -5196,3 +5196,26 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Count invalidation messages of specified transaction.
+ *
+ * Returns number of messages, and msgs is set to the pointer of the linked
+ * list for the messages.
+ */
+uint32
+ReorderBufferGetInvalidations(ReorderBuffer *rb, TransactionId xid,
+							  SharedInvalidationMessage **msgs)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	if (txn == NULL)
+		return 0;
+
+	*msgs = txn->invalidations;
+
+	return txn->ninvalidations;
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 9d8700147ca..8abd669c51e 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -290,7 +290,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
@@ -843,15 +843,15 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
@@ -859,7 +859,8 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 	/*
 	 * Iterate through all toplevel transactions. This can include
 	 * subtransactions which we just don't yet know to be that, but that's
-	 * fine, they will just get an unnecessary snapshot queued.
+	 * fine, they will just get an unnecessary snapshot and invalidations
+	 * queued.
 	 */
 	dlist_foreach(txn_i, &builder->reorder->toplevel_by_lsn)
 	{
@@ -872,6 +873,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -881,13 +890,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -897,6 +906,33 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of in-progress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding,
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid)
+		{
+			uint32 ninvalidations;
+			SharedInvalidationMessage *msgs = NULL;
+
+			ninvalidations = ReorderBufferGetInvalidations(builder->reorder,
+														   xid, &msgs);
+
+			if (ninvalidations > 0)
+			{
+				Assert(msgs != NULL);
+
+				ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+											  ninvalidations, msgs);
+			}
+		}
 	}
 }
 
@@ -1175,8 +1211,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * Add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions.
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d81b51..d399975e8a5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -676,6 +676,10 @@ TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
 void		ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
+uint32		ReorderBufferGetInvalidations(ReorderBuffer *rb,
+										  TransactionId xid,
+										  SharedInvalidationMessage **msgs);
+
 void		StartupReorderBuffer(void);
 
 #endif
-- 
2.43.5

v20_REL_15-0001-Fix-data-loss-in-logical-replication.txttext/plain; name=v20_REL_15-0001-Fix-data-loss-in-logical-replication.txtDownload

From 23a6ffa776bcae1125a6aef70703652b2850f908 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v20_REL_15] Fix data loss in logical replication

Previously, logical replication could lose data if one user modified a
publication to add a table while another user concurrently modified that table
and commit later than the publication modification transaction. The issue
arised during the decoding of transactions modifying the table: if the initial
catalog cache was built using a snapshot taken before the publication DDL
execution, all subsequent changes to that table were decoded with outdated
catalog cache, which caused them to be filtered from replication. This happened
because invalidation messages were only present in the publication modification
transaction, which was decoded before these subsequent changes.

This issue is not limited to publication DDLs; similar problems can occur with
ALTER TYPE statements executed concurrently with DMLs, leading to incorrect
decoding under outdated type contexts.

To address this, the commit improves logical decoding by ensuring that
invalidation messages from catalog-modifying transactions are distributed to
all concurrent in-progress transactions. This allows the necessary rebuild of
the catalog cache when decoding new changes, similar to handling historic
catalog snapshots (see SnapBuildDistributeNewCatalogSnapshot()).

Following this change, some performance regression is observed, primarily
during frequent execution of publication DDL statements that modify published
tables. This is an expected trade-off due to cache rebuild and distribution
overhead. The regression is minor or nearly nonexistent when DDLs do not affect
published tables or occur infrequently, making this a worthwhile cost to
resolve a longstanding data loss issue.

An alternative approach considered was to take a strong lock on each affected
table during publication modification. However, this would only address issues
related to publication DDLs and require locking every relation in the database
for publications created as FOR ALL TABLES, which is impractical. Thus, this
commit chooses to distribute invalidation messages as outlined above.
---
 contrib/test_decoding/Makefile                |  2 +-
 .../expected/invalidation_distrubution.out    | 20 ++++++
 .../specs/invalidation_distrubution.spec      | 32 +++++++++
 .../replication/logical/reorderbuffer.c       | 23 +++++++
 src/backend/replication/logical/snapbuild.c   | 67 +++++++++++++++----
 src/include/replication/reorderbuffer.h       |  4 ++
 6 files changed, 133 insertions(+), 15 deletions(-)
 create mode 100644 contrib/test_decoding/expected/invalidation_distrubution.out
 create mode 100644 contrib/test_decoding/specs/invalidation_distrubution.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index a4ba1a509ae..eef70770674 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot \
-	skip_snapshot_restore
+	skip_snapshot_restore invalidation_distrubution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distrubution.out
new file mode 100644
index 00000000000..24190ebe570
--- /dev/null
+++ b/contrib/test_decoding/expected/invalidation_distrubution.out
@@ -0,0 +1,20 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_alter_pub_add_tbl s1_commit s1_insert_tbl1 s2_get_binary_changes
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '3', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distrubution.spec
new file mode 100644
index 00000000000..f63aba3ce96
--- /dev/null
+++ b/contrib/test_decoding/specs/invalidation_distrubution.spec
@@ -0,0 +1,32 @@
+# Test that catalog cache invalidation messages are distributed to ongoing
+# transactions, ensuring they can access the updated catalog content after
+# processing these messages.
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+    CREATE TABLE tbl1(val1 integer, val2 integer);
+    CREATE PUBLICATION pub;
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    DROP PUBLICATION pub;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 (val1, val2) VALUES (1, 1); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2_alter_pub_add_tbl" { ALTER PUBLICATION pub ADD TABLE tbl1; }
+step "s2_get_binary_changes" { SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '3', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73; }
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_add_tbl" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d5cde20a1c9..835180e12f6 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -5200,3 +5200,26 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Count invalidation messages of specified transaction.
+ *
+ * Returns number of messages, and msgs is set to the pointer of the linked
+ * list for the messages.
+ */
+uint32
+ReorderBufferGetInvalidations(ReorderBuffer *rb, TransactionId xid,
+							  SharedInvalidationMessage **msgs)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	if (txn == NULL)
+		return 0;
+
+	*msgs = txn->invalidations;
+
+	return txn->ninvalidations;
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index cc1f2a9f154..0b303f9a235 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -290,7 +290,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
@@ -852,15 +852,15 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
@@ -868,7 +868,8 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 	/*
 	 * Iterate through all toplevel transactions. This can include
 	 * subtransactions which we just don't yet know to be that, but that's
-	 * fine, they will just get an unnecessary snapshot queued.
+	 * fine, they will just get an unnecessary snapshot and invalidations
+	 * queued.
 	 */
 	dlist_foreach(txn_i, &builder->reorder->toplevel_by_lsn)
 	{
@@ -881,6 +882,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -890,13 +899,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -906,6 +915,33 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of in-progress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding,
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid)
+		{
+			uint32 ninvalidations;
+			SharedInvalidationMessage *msgs = NULL;
+
+			ninvalidations = ReorderBufferGetInvalidations(builder->reorder,
+														   xid, &msgs);
+
+			if (ninvalidations > 0)
+			{
+				Assert(msgs != NULL);
+
+				ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+											  ninvalidations, msgs);
+			}
+		}
 	}
 }
 
@@ -1184,8 +1220,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * Add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions.
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4a01f877e5d..402bb7a2728 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -680,6 +680,10 @@ extern TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
 
+extern uint32 ReorderBufferGetInvalidations(ReorderBuffer *rb,
+											TransactionId xid,
+											SharedInvalidationMessage **msgs);
+
 extern void StartupReorderBuffer(void);
 
 #endif
-- 
2.43.5

v20_REL_16-0001-Fix-data-loss-in-logical-replication.txttext/plain; name=v20_REL_16-0001-Fix-data-loss-in-logical-replication.txtDownload

From febbb2e89f4c7ecc74ae9b7bc6194be19f0376ca Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v20_REL_16] Fix data loss in logical replication

Previously, logical replication could lose data if one user modified a
publication to add a table while another user concurrently modified that table
and commit later than the publication modification transaction. The issue
arised during the decoding of transactions modifying the table: if the initial
catalog cache was built using a snapshot taken before the publication DDL
execution, all subsequent changes to that table were decoded with outdated
catalog cache, which caused them to be filtered from replication. This happened
because invalidation messages were only present in the publication modification
transaction, which was decoded before these subsequent changes.

This issue is not limited to publication DDLs; similar problems can occur with
ALTER TYPE statements executed concurrently with DMLs, leading to incorrect
decoding under outdated type contexts.

To address this, the commit improves logical decoding by ensuring that
invalidation messages from catalog-modifying transactions are distributed to
all concurrent in-progress transactions. This allows the necessary rebuild of
the catalog cache when decoding new changes, similar to handling historic
catalog snapshots (see SnapBuildDistributeNewCatalogSnapshot()).

Following this change, some performance regression is observed, primarily
during frequent execution of publication DDL statements that modify published
tables. This is an expected trade-off due to cache rebuild and distribution
overhead. The regression is minor or nearly nonexistent when DDLs do not affect
published tables or occur infrequently, making this a worthwhile cost to
resolve a longstanding data loss issue.

An alternative approach considered was to take a strong lock on each affected
table during publication modification. However, this would only address issues
related to publication DDLs and require locking every relation in the database
for publications created as FOR ALL TABLES, which is impractical. Thus, this
commit chooses to distribute invalidation messages as outlined above.
---
 contrib/test_decoding/Makefile                |  2 +-
 .../expected/invalidation_distrubution.out    | 20 ++++++
 contrib/test_decoding/meson.build             |  1 +
 .../specs/invalidation_distrubution.spec      | 32 +++++++++
 .../replication/logical/reorderbuffer.c       | 23 +++++++
 src/backend/replication/logical/snapbuild.c   | 67 +++++++++++++++----
 src/include/replication/reorderbuffer.h       |  4 ++
 7 files changed, 134 insertions(+), 15 deletions(-)
 create mode 100644 contrib/test_decoding/expected/invalidation_distrubution.out
 create mode 100644 contrib/test_decoding/specs/invalidation_distrubution.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index a4ba1a509ae..eef70770674 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot \
-	skip_snapshot_restore
+	skip_snapshot_restore invalidation_distrubution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distrubution.out
new file mode 100644
index 00000000000..ad0a944cbf3
--- /dev/null
+++ b/contrib/test_decoding/expected/invalidation_distrubution.out
@@ -0,0 +1,20 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_alter_pub_add_tbl s1_commit s1_insert_tbl1 s2_get_binary_changes
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/meson.build b/contrib/test_decoding/meson.build
index 2dd3ede41bf..273d26643c0 100644
--- a/contrib/test_decoding/meson.build
+++ b/contrib/test_decoding/meson.build
@@ -63,6 +63,7 @@ tests += {
       'twophase_snapshot',
       'slot_creation_error',
       'skip_snapshot_restore',
+      'invalidation_distrubution',
     ],
     'regress_args': [
       '--temp-config', files('logical.conf'),
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distrubution.spec
new file mode 100644
index 00000000000..decbed627e3
--- /dev/null
+++ b/contrib/test_decoding/specs/invalidation_distrubution.spec
@@ -0,0 +1,32 @@
+# Test that catalog cache invalidation messages are distributed to ongoing
+# transactions, ensuring they can access the updated catalog content after
+# processing these messages.
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+    CREATE TABLE tbl1(val1 integer, val2 integer);
+    CREATE PUBLICATION pub;
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    DROP PUBLICATION pub;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 (val1, val2) VALUES (1, 1); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2_alter_pub_add_tbl" { ALTER PUBLICATION pub ADD TABLE tbl1; }
+step "s2_get_binary_changes" { SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73; }
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_add_tbl" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 930549948af..fa04e829cc9 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -5283,3 +5283,26 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Count invalidation messages of specified transaction.
+ *
+ * Returns number of messages, and msgs is set to the pointer of the linked
+ * list for the messages.
+ */
+uint32
+ReorderBufferGetInvalidations(ReorderBuffer *rb, TransactionId xid,
+							  SharedInvalidationMessage **msgs)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	if (txn == NULL)
+		return 0;
+
+	*msgs = txn->invalidations;
+
+	return txn->ninvalidations;
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 3ed2f79dd06..d48ebb8337b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -288,7 +288,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -845,15 +845,15 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
@@ -861,7 +861,8 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 	/*
 	 * Iterate through all toplevel transactions. This can include
 	 * subtransactions which we just don't yet know to be that, but that's
-	 * fine, they will just get an unnecessary snapshot queued.
+	 * fine, they will just get an unnecessary snapshot and invalidations
+	 * queued.
 	 */
 	dlist_foreach(txn_i, &builder->reorder->toplevel_by_lsn)
 	{
@@ -874,6 +875,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -883,13 +892,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -899,6 +908,33 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of in-progress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding,
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid)
+		{
+			uint32 ninvalidations;
+			SharedInvalidationMessage *msgs = NULL;
+
+			ninvalidations = ReorderBufferGetInvalidations(builder->reorder,
+														   xid, &msgs);
+
+			if (ninvalidations > 0)
+			{
+				Assert(msgs != NULL);
+
+				ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+											  ninvalidations, msgs);
+			}
+		}
 	}
 }
 
@@ -1170,8 +1206,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * Add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions.
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3cb03168de2..8a32bbe28df 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -748,6 +748,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern uint32 ReorderBufferGetInvalidations(ReorderBuffer *rb,
+											TransactionId xid,
+											SharedInvalidationMessage **msgs);
+
 extern void StartupReorderBuffer(void);
 
 #endif
-- 
2.43.5

v20_REL_17-0001-Fix-data-loss-in-logical-replication.txttext/plain; name=v20_REL_17-0001-Fix-data-loss-in-logical-replication.txtDownload

From baf14b0c37827f301ad93e3903fee9048be97b52 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v20_REL_17] Fix data loss in logical replication

Previously, logical replication could lose data if one user modified a
publication to add a table while another user concurrently modified that table
and commit later than the publication modification transaction. The issue
arised during the decoding of transactions modifying the table: if the initial
catalog cache was built using a snapshot taken before the publication DDL
execution, all subsequent changes to that table were decoded with outdated
catalog cache, which caused them to be filtered from replication. This happened
because invalidation messages were only present in the publication modification
transaction, which was decoded before these subsequent changes.

This issue is not limited to publication DDLs; similar problems can occur with
ALTER TYPE statements executed concurrently with DMLs, leading to incorrect
decoding under outdated type contexts.

To address this, the commit improves logical decoding by ensuring that
invalidation messages from catalog-modifying transactions are distributed to
all concurrent in-progress transactions. This allows the necessary rebuild of
the catalog cache when decoding new changes, similar to handling historic
catalog snapshots (see SnapBuildDistributeNewCatalogSnapshot()).

Following this change, some performance regression is observed, primarily
during frequent execution of publication DDL statements that modify published
tables. This is an expected trade-off due to cache rebuild and distribution
overhead. The regression is minor or nearly nonexistent when DDLs do not affect
published tables or occur infrequently, making this a worthwhile cost to
resolve a longstanding data loss issue.

An alternative approach considered was to take a strong lock on each affected
table during publication modification. However, this would only address issues
related to publication DDLs and require locking every relation in the database
for publications created as FOR ALL TABLES, which is impractical. Thus, this
commit chooses to distribute invalidation messages as outlined above.
---
 contrib/test_decoding/Makefile                |  2 +-
 .../expected/invalidation_distrubution.out    | 20 ++++++
 contrib/test_decoding/meson.build             |  1 +
 .../specs/invalidation_distrubution.spec      | 32 +++++++++
 .../replication/logical/reorderbuffer.c       | 23 +++++++
 src/backend/replication/logical/snapbuild.c   | 67 +++++++++++++++----
 src/include/replication/reorderbuffer.h       |  4 ++
 7 files changed, 134 insertions(+), 15 deletions(-)
 create mode 100644 contrib/test_decoding/expected/invalidation_distrubution.out
 create mode 100644 contrib/test_decoding/specs/invalidation_distrubution.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index a4ba1a509ae..eef70770674 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot \
-	skip_snapshot_restore
+	skip_snapshot_restore invalidation_distrubution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distrubution.out
new file mode 100644
index 00000000000..ad0a944cbf3
--- /dev/null
+++ b/contrib/test_decoding/expected/invalidation_distrubution.out
@@ -0,0 +1,20 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_alter_pub_add_tbl s1_commit s1_insert_tbl1 s2_get_binary_changes
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/meson.build b/contrib/test_decoding/meson.build
index f643dc81a2c..b31c433681d 100644
--- a/contrib/test_decoding/meson.build
+++ b/contrib/test_decoding/meson.build
@@ -63,6 +63,7 @@ tests += {
       'twophase_snapshot',
       'slot_creation_error',
       'skip_snapshot_restore',
+      'invalidation_distrubution',
     ],
     'regress_args': [
       '--temp-config', files('logical.conf'),
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distrubution.spec
new file mode 100644
index 00000000000..decbed627e3
--- /dev/null
+++ b/contrib/test_decoding/specs/invalidation_distrubution.spec
@@ -0,0 +1,32 @@
+# Test that catalog cache invalidation messages are distributed to ongoing
+# transactions, ensuring they can access the updated catalog content after
+# processing these messages.
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+    CREATE TABLE tbl1(val1 integer, val2 integer);
+    CREATE PUBLICATION pub;
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    DROP PUBLICATION pub;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 (val1, val2) VALUES (1, 1); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2_alter_pub_add_tbl" { ALTER PUBLICATION pub ADD TABLE tbl1; }
+step "s2_get_binary_changes" { SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73; }
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_add_tbl" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 9c742e96eb3..03eb005c39d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -5337,3 +5337,26 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Count invalidation messages of specified transaction.
+ *
+ * Returns number of messages, and msgs is set to the pointer of the linked
+ * list for the messages.
+ */
+uint32
+ReorderBufferGetInvalidations(ReorderBuffer *rb, TransactionId xid,
+							  SharedInvalidationMessage **msgs)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	if (txn == NULL)
+		return 0;
+
+	*msgs = txn->invalidations;
+
+	return txn->ninvalidations;
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ae676145e60..110e0b0a044 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -300,7 +300,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -859,15 +859,15 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
@@ -875,7 +875,8 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 	/*
 	 * Iterate through all toplevel transactions. This can include
 	 * subtransactions which we just don't yet know to be that, but that's
-	 * fine, they will just get an unnecessary snapshot queued.
+	 * fine, they will just get an unnecessary snapshot and invalidations
+	 * queued.
 	 */
 	dlist_foreach(txn_i, &builder->reorder->toplevel_by_lsn)
 	{
@@ -888,6 +889,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -897,13 +906,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -913,6 +922,33 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of in-progress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding,
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid)
+		{
+			uint32 ninvalidations;
+			SharedInvalidationMessage *msgs = NULL;
+
+			ninvalidations = ReorderBufferGetInvalidations(builder->reorder,
+														   xid, &msgs);
+
+			if (ninvalidations > 0)
+			{
+				Assert(msgs != NULL);
+
+				ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+											  ninvalidations, msgs);
+			}
+		}
 	}
 }
 
@@ -1184,8 +1220,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * Add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions.
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 7de50462dcf..4c56f219fd8 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -729,6 +729,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern uint32 ReorderBufferGetInvalidations(ReorderBuffer *rb,
+											TransactionId xid,
+											SharedInvalidationMessage **msgs);
+
 extern void StartupReorderBuffer(void);
 
 #endif
-- 
2.43.5

v20-0001-Fix-data-loss-in-logical-replication.patchapplication/octet-stream; name=v20-0001-Fix-data-loss-in-logical-replication.patchDownload

From f4ec2e761023a92495e33603f9a7a330a64b6f27 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 23 Aug 2024 14:02:20 +0530
Subject: [PATCH v20] Fix data loss in logical replication

Previously, logical replication could lose data if one user modified a
publication to add a table while another user concurrently modified that table
and commit later than the publication modification transaction. The issue
arised during the decoding of transactions modifying the table: if the initial
catalog cache was built using a snapshot taken before the publication DDL
execution, all subsequent changes to that table were decoded with outdated
catalog cache, which caused them to be filtered from replication. This happened
because invalidation messages were only present in the publication modification
transaction, which was decoded before these subsequent changes.

This issue is not limited to publication DDLs; similar problems can occur with
ALTER TYPE statements executed concurrently with DMLs, leading to incorrect
decoding under outdated type contexts.

To address this, the commit improves logical decoding by ensuring that
invalidation messages from catalog-modifying transactions are distributed to
all concurrent in-progress transactions. This allows the necessary rebuild of
the catalog cache when decoding new changes, similar to handling historic
catalog snapshots (see SnapBuildDistributeNewCatalogSnapshot()).

Following this change, some performance regression is observed, primarily
during frequent execution of publication DDL statements that modify published
tables. This is an expected trade-off due to cache rebuild and distribution
overhead. The regression is minor or nearly nonexistent when DDLs do not affect
published tables or occur infrequently, making this a worthwhile cost to
resolve a longstanding data loss issue.

An alternative approach considered was to take a strong lock on each affected
table during publication modification. However, this would only address issues
related to publication DDLs and require locking every relation in the database
for publications created as FOR ALL TABLES, which is impractical. Thus, this
commit chooses to distribute invalidation messages as outlined above.
---
 contrib/test_decoding/Makefile                |  2 +-
 .../expected/invalidation_distrubution.out    | 20 ++++++
 contrib/test_decoding/meson.build             |  1 +
 .../specs/invalidation_distrubution.spec      | 32 +++++++++
 .../replication/logical/reorderbuffer.c       | 23 +++++++
 src/backend/replication/logical/snapbuild.c   | 67 +++++++++++++++----
 src/include/replication/reorderbuffer.h       |  4 ++
 7 files changed, 134 insertions(+), 15 deletions(-)
 create mode 100644 contrib/test_decoding/expected/invalidation_distrubution.out
 create mode 100644 contrib/test_decoding/specs/invalidation_distrubution.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index a4ba1a509ae..eef70770674 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot \
-	skip_snapshot_restore
+	skip_snapshot_restore invalidation_distrubution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distrubution.out
new file mode 100644
index 00000000000..ad0a944cbf3
--- /dev/null
+++ b/contrib/test_decoding/expected/invalidation_distrubution.out
@@ -0,0 +1,20 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_insert_tbl1 s1_begin s1_insert_tbl1 s2_alter_pub_add_tbl s1_commit s1_insert_tbl1 s2_get_binary_changes
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-----
+    1
+(1 row)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/meson.build b/contrib/test_decoding/meson.build
index 54d65d3f30f..d5f03fd5e9b 100644
--- a/contrib/test_decoding/meson.build
+++ b/contrib/test_decoding/meson.build
@@ -63,6 +63,7 @@ tests += {
       'twophase_snapshot',
       'slot_creation_error',
       'skip_snapshot_restore',
+      'invalidation_distrubution',
     ],
     'regress_args': [
       '--temp-config', files('logical.conf'),
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distrubution.spec
new file mode 100644
index 00000000000..decbed627e3
--- /dev/null
+++ b/contrib/test_decoding/specs/invalidation_distrubution.spec
@@ -0,0 +1,32 @@
+# Test that catalog cache invalidation messages are distributed to ongoing
+# transactions, ensuring they can access the updated catalog content after
+# processing these messages.
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+    CREATE TABLE tbl1(val1 integer, val2 integer);
+    CREATE PUBLICATION pub;
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    DROP PUBLICATION pub;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 (val1, val2) VALUES (1, 1); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2_alter_pub_add_tbl" { ALTER PUBLICATION pub ADD TABLE tbl1; }
+step "s2_get_binary_changes" { SELECT count(data) FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73; }
+
+# Expect to get one insert change. LOGICAL_REP_MSG_INSERT = 'I'
+permutation "s1_insert_tbl1" "s1_begin" "s1_insert_tbl1" "s2_alter_pub_add_tbl" "s1_commit" "s1_insert_tbl1" "s2_get_binary_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 977fbcd2474..67655111875 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -5460,3 +5460,26 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Count invalidation messages of specified transaction.
+ *
+ * Returns number of messages, and msgs is set to the pointer of the linked
+ * list for the messages.
+ */
+uint32
+ReorderBufferGetInvalidations(ReorderBuffer *rb, TransactionId xid,
+							  SharedInvalidationMessage **msgs)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	if (txn == NULL)
+		return 0;
+
+	*msgs = txn->invalidations;
+
+	return txn->ninvalidations;
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index b64e53de017..97188d2e747 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -161,7 +161,7 @@ static void SnapBuildFreeSnapshot(Snapshot snap);
 
 static void SnapBuildSnapIncRefcount(Snapshot snap);
 
-static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
+static void SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
 												 uint32 xinfo);
@@ -720,15 +720,15 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 }
 
 /*
- * Add a new Snapshot to all transactions we're decoding that currently are
- * in-progress so they can see new catalog contents made by the transaction
- * that just committed. This is necessary because those in-progress
- * transactions will use the new catalog's contents from here on (at the very
- * least everything they do needs to be compatible with newer catalog
- * contents).
+ * Add a new Snapshot and invalidation messages to all transactions we're
+ * decoding that currently are in-progress so they can see new catalog contents
+ * made by the transaction that just committed. This is necessary because those
+ * in-progress transactions will use the new catalog's contents from here on
+ * (at the very least everything they do needs to be compatible with newer
+ * catalog contents).
  */
 static void
-SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid)
 {
 	dlist_iter	txn_i;
 	ReorderBufferTXN *txn;
@@ -736,7 +736,8 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 	/*
 	 * Iterate through all toplevel transactions. This can include
 	 * subtransactions which we just don't yet know to be that, but that's
-	 * fine, they will just get an unnecessary snapshot queued.
+	 * fine, they will just get an unnecessary snapshot and invalidations
+	 * queued.
 	 */
 	dlist_foreach(txn_i, &builder->reorder->toplevel_by_lsn)
 	{
@@ -749,6 +750,14 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * transaction which in turn implies we don't yet need a snapshot at
 		 * all. We'll add a snapshot when the first change gets queued.
 		 *
+		 * Similarly, we don't need to add invalidations to a transaction whose
+		 * base snapshot is not yet set. Once a base snapshot is built, it will
+		 * include the xids of committed transactions that have modified the
+		 * catalog, thus reflecting the new catalog contents. The existing
+		 * catalog cache will have already been invalidated after processing
+		 * the invalidations in the transaction that modified catalogs,
+		 * ensuring that a fresh cache is constructed during decoding.
+		 *
 		 * NB: This works correctly even for subtransactions because
 		 * ReorderBufferAssignChild() takes care to transfer the base snapshot
 		 * to the top-level transaction, and while iterating the changequeue
@@ -758,13 +767,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 			continue;
 
 		/*
-		 * We don't need to add snapshot to prepared transactions as they
-		 * should not see the new catalog contents.
+		 * We don't need to add snapshot or invalidations to prepared
+		 * transactions as they should not see the new catalog contents.
 		 */
 		if (rbtxn_is_prepared(txn))
 			continue;
 
-		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
+		elog(DEBUG2, "adding a new snapshot and invalidations to %u at %X/%X",
 			 txn->xid, LSN_FORMAT_ARGS(lsn));
 
 		/*
@@ -774,6 +783,33 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		SnapBuildSnapIncRefcount(builder->snapshot);
 		ReorderBufferAddSnapshot(builder->reorder, txn->xid, lsn,
 								 builder->snapshot);
+
+		/*
+		 * Add invalidation messages to the reorder buffer of in-progress
+		 * transactions except the current committed transaction, for which we
+		 * will execute invalidations at the end.
+		 *
+		 * It is required, otherwise, we will end up using the stale catcache
+		 * contents built by the current transaction even after its decoding,
+		 * which should have been invalidated due to concurrent catalog
+		 * changing transaction.
+		 */
+		if (txn->xid != xid)
+		{
+			uint32 ninvalidations;
+			SharedInvalidationMessage *msgs = NULL;
+
+			ninvalidations = ReorderBufferGetInvalidations(builder->reorder,
+														   xid, &msgs);
+
+			if (ninvalidations > 0)
+			{
+				Assert(msgs != NULL);
+
+				ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
+											  ninvalidations, msgs);
+			}
+		}
 	}
 }
 
@@ -1045,8 +1081,11 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* refcount of the snapshot builder for the new snapshot */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 
-		/* add a new catalog snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		/*
+		 * Add a new catalog snapshot and invalidations messages to all
+		 * currently running transactions.
+		 */
+		SnapBuildDistributeSnapshotAndInval(builder, lsn, xid);
 	}
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3be0cbd7ebe..24e88c409ba 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -758,6 +758,10 @@ extern TransactionId *ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb);
 
 extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
+extern uint32 ReorderBufferGetInvalidations(ReorderBuffer *rb,
+											TransactionId xid,
+											SharedInvalidationMessage **msgs);
+
 extern void StartupReorderBuffer(void);
 
 #endif
-- 
2.43.5

#96

Amit Kapila

amit.kapila16@gmail.com

10 months ago

In reply to: Hayato Kuroda (Fujitsu) (#94)

Re: long-standing data loss bug in initial sync of logical replication

On Mon, Mar 17, 2025 at 4:56 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Regarding the PG13, it cannot be
applied
as-is thus some adjustments are needed. I will share it in upcoming posts.

Here is a patch set for PG13. Apart from PG14-17, the patch could be created as-is,
because...

1. WAL record for invalidation messages (XLOG_XACT_INVALIDATIONS) does not exist.
2. Thus the ReorderBufferChange for the invalidation does not exist.
Our patch tries to distribute it but cannot be done as-is.
3. Codes assumed that invalidation messages can be added only once.
4. The timing when invalidation messages are consumed is limited:
a. COMMAND_ID change is poped,
b. start of decoding a transaction, or
c. end of decoding a transaction.

Above means that invalidations cannot be executed while being decoded.
I created two patch sets to resolve the data loss issue. 0001 has less code
changes but could resolve a part of issue, 0002 has huge changes but provides a
complete solution.

0001 - mostly same as patches for other versions. ReorderBufferAddInvalidations()
was adjusted to allow being called several times. As I said above,
0001 cannot execute inval messages while decoding the transacitons.
0002 - introduces new ReorderBufferChange type to indicate inval messages.
It would be handled like PG14+.

Here is an example. Assuming that the table foo exists on both nodes, a
publication "pub" which publishes all tables, and a subscription "sub" which
subscribes "pub". What if the workload is executed?

```
S1 S2
BEGIN;
INSERT INTO foo VALUES (1)
ALTER PUBLICATION pub RENAME TO pub_renamed;
INSERT INTO foo VALUES (2)
COMMIT;
LR -> ?
```

With 0001, tuples (1) and (2) would be replicated to the subscriber.
An error "publication "pub" does not exist" would raise when new changes are done
later.

0001+0002 works more aggressively; the error would raise when S1 transaction is decoded.
The behavior is same as for patched PG14-PG17.

I understand that with 0001 the fix is partial in the sense that
because invalidations are processed at the transaction end, the
changes of concurrent DDL will only be reflected for the next
transaction. Now, on one hand, it is prudent to not add a new type of
ReorderBufferChange in the backbranch (v13) but the change is not that
invasive, so we can go with it as well. My preference would be to go
with just 0001 for v13 to minimize the risk of adding new bugs or
breaking something unintentionally.

Sawada-San, and others involved here, do you have any suggestions on
this matter?

--
With Regards,
Amit Kapila.

#97

Amit Kapila

amit.kapila16@gmail.com

9 months ago

In reply to: Amit Kapila (#96)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, Mar 18, 2025 at 3:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 17, 2025 at 4:56 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Regarding the PG13, it cannot be
applied
as-is thus some adjustments are needed. I will share it in upcoming posts.

Here is a patch set for PG13. Apart from PG14-17, the patch could be created as-is,
because...

1. WAL record for invalidation messages (XLOG_XACT_INVALIDATIONS) does not exist.
2. Thus the ReorderBufferChange for the invalidation does not exist.
Our patch tries to distribute it but cannot be done as-is.
3. Codes assumed that invalidation messages can be added only once.
4. The timing when invalidation messages are consumed is limited:
a. COMMAND_ID change is poped,
b. start of decoding a transaction, or
c. end of decoding a transaction.

Above means that invalidations cannot be executed while being decoded.
I created two patch sets to resolve the data loss issue. 0001 has less code
changes but could resolve a part of issue, 0002 has huge changes but provides a
complete solution.

0001 - mostly same as patches for other versions. ReorderBufferAddInvalidations()
was adjusted to allow being called several times. As I said above,
0001 cannot execute inval messages while decoding the transacitons.
0002 - introduces new ReorderBufferChange type to indicate inval messages.
It would be handled like PG14+.

Here is an example. Assuming that the table foo exists on both nodes, a
publication "pub" which publishes all tables, and a subscription "sub" which
subscribes "pub". What if the workload is executed?

```
S1 S2
BEGIN;
INSERT INTO foo VALUES (1)
ALTER PUBLICATION pub RENAME TO pub_renamed;
INSERT INTO foo VALUES (2)
COMMIT;
LR -> ?
```

With 0001, tuples (1) and (2) would be replicated to the subscriber.
An error "publication "pub" does not exist" would raise when new changes are done
later.

0001+0002 works more aggressively; the error would raise when S1 transaction is decoded.
The behavior is same as for patched PG14-PG17.

I understand that with 0001 the fix is partial in the sense that
because invalidations are processed at the transaction end, the
changes of concurrent DDL will only be reflected for the next
transaction. Now, on one hand, it is prudent to not add a new type of
ReorderBufferChange in the backbranch (v13) but the change is not that
invasive, so we can go with it as well. My preference would be to go
with just 0001 for v13 to minimize the risk of adding new bugs or
breaking something unintentionally.

Sawada-San, and others involved here, do you have any suggestions on
this matter?

Seeing no responses for a long time, I am planning to push the fix
till 14 tomorrow unless there are some opinions on the fix for 13. We
can continue to discuss the scope of the fix for 13.

--
With Regards,
Amit Kapila.

#98

Amit Kapila

amit.kapila16@gmail.com

9 months ago

In reply to: Amit Kapila (#97)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, Apr 8, 2025 at 3:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 18, 2025 at 3:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Sawada-San, and others involved here, do you have any suggestions on
this matter?

Seeing no responses for a long time, I am planning to push the fix
till 14 tomorrow unless there are some opinions on the fix for 13. We
can continue to discuss the scope of the fix for 13.

Pushed till 14.

--
With Regards,
Amit Kapila.

#99

Tomas Vondra

tomas@vondra.me

9 months ago

In reply to: Amit Kapila (#98)

Re: long-standing data loss bug in initial sync of logical replication

On 4/10/25 11:45, Amit Kapila wrote:

On Tue, Apr 8, 2025 at 3:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 18, 2025 at 3:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Sawada-San, and others involved here, do you have any suggestions on
this matter?

Seeing no responses for a long time, I am planning to push the fix
till 14 tomorrow unless there are some opinions on the fix for 13. We
can continue to discuss the scope of the fix for 13.

Pushed till 14.

Thanks everyone who persevered and kept working on fixing this! Highly
appreciated.

regards

--
Tomas Vondra

#100

Masahiko Sawada

sawada.mshk@gmail.com

9 months ago

In reply to: Amit Kapila (#96)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, Mar 18, 2025 at 2:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 17, 2025 at 4:56 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Regarding the PG13, it cannot be
applied
as-is thus some adjustments are needed. I will share it in upcoming posts.

Here is a patch set for PG13. Apart from PG14-17, the patch could be created as-is,
because...

1. WAL record for invalidation messages (XLOG_XACT_INVALIDATIONS) does not exist.
2. Thus the ReorderBufferChange for the invalidation does not exist.
Our patch tries to distribute it but cannot be done as-is.
3. Codes assumed that invalidation messages can be added only once.
4. The timing when invalidation messages are consumed is limited:
a. COMMAND_ID change is poped,
b. start of decoding a transaction, or
c. end of decoding a transaction.

Above means that invalidations cannot be executed while being decoded.
I created two patch sets to resolve the data loss issue. 0001 has less code
changes but could resolve a part of issue, 0002 has huge changes but provides a
complete solution.

0001 - mostly same as patches for other versions. ReorderBufferAddInvalidations()
was adjusted to allow being called several times. As I said above,
0001 cannot execute inval messages while decoding the transacitons.
0002 - introduces new ReorderBufferChange type to indicate inval messages.
It would be handled like PG14+.

Here is an example. Assuming that the table foo exists on both nodes, a
publication "pub" which publishes all tables, and a subscription "sub" which
subscribes "pub". What if the workload is executed?

```
S1 S2
BEGIN;
INSERT INTO foo VALUES (1)
ALTER PUBLICATION pub RENAME TO pub_renamed;
INSERT INTO foo VALUES (2)
COMMIT;
LR -> ?
```

With 0001, tuples (1) and (2) would be replicated to the subscriber.
An error "publication "pub" does not exist" would raise when new changes are done
later.

0001+0002 works more aggressively; the error would raise when S1 transaction is decoded.
The behavior is same as for patched PG14-PG17.

I understand that with 0001 the fix is partial in the sense that
because invalidations are processed at the transaction end, the
changes of concurrent DDL will only be reflected for the next
transaction. Now, on one hand, it is prudent to not add a new type of
ReorderBufferChange in the backbranch (v13) but the change is not that
invasive, so we can go with it as well. My preference would be to go
with just 0001 for v13 to minimize the risk of adding new bugs or
breaking something unintentionally.

Sawada-San, and others involved here, do you have any suggestions on
this matter?

Sorry for the late response.

I agree with just 0001 for v13 as 0002 seems invasive. Given that v13
would have only a few releases until EOL and 0001 can deal with some
cases in question, I'd like to avoid such invasive changes in v13. It
would not be advisable to change the ReorderBufferChange format in
minor release even though it would not change the struct size.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#101

Amit Kapila

amit.kapila16@gmail.com

9 months ago

In reply to: Masahiko Sawada (#100)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, Apr 22, 2025 at 10:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 18, 2025 at 2:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 17, 2025 at 4:56 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Regarding the PG13, it cannot be
applied
as-is thus some adjustments are needed. I will share it in upcoming posts.

Here is a patch set for PG13. Apart from PG14-17, the patch could be created as-is,
because...

1. WAL record for invalidation messages (XLOG_XACT_INVALIDATIONS) does not exist.
2. Thus the ReorderBufferChange for the invalidation does not exist.
Our patch tries to distribute it but cannot be done as-is.
3. Codes assumed that invalidation messages can be added only once.
4. The timing when invalidation messages are consumed is limited:
a. COMMAND_ID change is poped,
b. start of decoding a transaction, or
c. end of decoding a transaction.

Above means that invalidations cannot be executed while being decoded.
I created two patch sets to resolve the data loss issue. 0001 has less code
changes but could resolve a part of issue, 0002 has huge changes but provides a
complete solution.

0001 - mostly same as patches for other versions. ReorderBufferAddInvalidations()
was adjusted to allow being called several times. As I said above,
0001 cannot execute inval messages while decoding the transacitons.
0002 - introduces new ReorderBufferChange type to indicate inval messages.
It would be handled like PG14+.

Here is an example. Assuming that the table foo exists on both nodes, a
publication "pub" which publishes all tables, and a subscription "sub" which
subscribes "pub". What if the workload is executed?

```
S1 S2
BEGIN;
INSERT INTO foo VALUES (1)
ALTER PUBLICATION pub RENAME TO pub_renamed;
INSERT INTO foo VALUES (2)
COMMIT;
LR -> ?
```

With 0001, tuples (1) and (2) would be replicated to the subscriber.
An error "publication "pub" does not exist" would raise when new changes are done
later.

0001+0002 works more aggressively; the error would raise when S1 transaction is decoded.
The behavior is same as for patched PG14-PG17.

I understand that with 0001 the fix is partial in the sense that
because invalidations are processed at the transaction end, the
changes of concurrent DDL will only be reflected for the next
transaction. Now, on one hand, it is prudent to not add a new type of
ReorderBufferChange in the backbranch (v13) but the change is not that
invasive, so we can go with it as well. My preference would be to go
with just 0001 for v13 to minimize the risk of adding new bugs or
breaking something unintentionally.

Sawada-San, and others involved here, do you have any suggestions on
this matter?

Sorry for the late response.

I agree with just 0001 for v13 as 0002 seems invasive. Given that v13
would have only a few releases until EOL and 0001 can deal with some
cases in question, I'd like to avoid such invasive changes in v13.

Fair enough. OTOH, we can leave the 13 branch considering following:
(a) it is near EOL, (b) bug happens in rare cases (when the DDLs like
ALTER PUBLICATION ... ADD TABLE ... or ALTER TYPE ... that don't take
a strong lock on table happens concurrently to DMLs on the tables
involved in the DDL.), and (c) the complete fix is invasive, even
partial fix is not simple. I have a slight fear that if we make any
mistake in fixing it partially (of course, we can't see any today), we
may not even get a chance to fix it.

Now, if the above convinces you or someone else not to push the
partial fix in 13, then fine; otherwise, I'll push the 0001 to 13 day
after tomorrow.

--
With Regards,
Amit Kapila.

#102

Masahiko Sawada

sawada.mshk@gmail.com

9 months ago

In reply to: Amit Kapila (#101)

Re: long-standing data loss bug in initial sync of logical replication

On Tue, Apr 22, 2025 at 11:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Apr 22, 2025 at 10:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 18, 2025 at 2:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 17, 2025 at 4:56 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Regarding the PG13, it cannot be
applied
as-is thus some adjustments are needed. I will share it in upcoming posts.

Here is a patch set for PG13. Apart from PG14-17, the patch could be created as-is,
because...

1. WAL record for invalidation messages (XLOG_XACT_INVALIDATIONS) does not exist.
2. Thus the ReorderBufferChange for the invalidation does not exist.
Our patch tries to distribute it but cannot be done as-is.
3. Codes assumed that invalidation messages can be added only once.
4. The timing when invalidation messages are consumed is limited:
a. COMMAND_ID change is poped,
b. start of decoding a transaction, or
c. end of decoding a transaction.

Above means that invalidations cannot be executed while being decoded.
I created two patch sets to resolve the data loss issue. 0001 has less code
changes but could resolve a part of issue, 0002 has huge changes but provides a
complete solution.

0001 - mostly same as patches for other versions. ReorderBufferAddInvalidations()
was adjusted to allow being called several times. As I said above,
0001 cannot execute inval messages while decoding the transacitons.
0002 - introduces new ReorderBufferChange type to indicate inval messages.
It would be handled like PG14+.

Here is an example. Assuming that the table foo exists on both nodes, a
publication "pub" which publishes all tables, and a subscription "sub" which
subscribes "pub". What if the workload is executed?

```
S1 S2
BEGIN;
INSERT INTO foo VALUES (1)
ALTER PUBLICATION pub RENAME TO pub_renamed;
INSERT INTO foo VALUES (2)
COMMIT;
LR -> ?
```

With 0001, tuples (1) and (2) would be replicated to the subscriber.
An error "publication "pub" does not exist" would raise when new changes are done
later.

0001+0002 works more aggressively; the error would raise when S1 transaction is decoded.
The behavior is same as for patched PG14-PG17.

I understand that with 0001 the fix is partial in the sense that
because invalidations are processed at the transaction end, the
changes of concurrent DDL will only be reflected for the next
transaction. Now, on one hand, it is prudent to not add a new type of
ReorderBufferChange in the backbranch (v13) but the change is not that
invasive, so we can go with it as well. My preference would be to go
with just 0001 for v13 to minimize the risk of adding new bugs or
breaking something unintentionally.

Sawada-San, and others involved here, do you have any suggestions on
this matter?

Sorry for the late response.

I agree with just 0001 for v13 as 0002 seems invasive. Given that v13
would have only a few releases until EOL and 0001 can deal with some
cases in question, I'd like to avoid such invasive changes in v13.

Fair enough. OTOH, we can leave the 13 branch considering following:
(a) it is near EOL, (b) bug happens in rare cases (when the DDLs like
ALTER PUBLICATION ... ADD TABLE ... or ALTER TYPE ... that don't take
a strong lock on table happens concurrently to DMLs on the tables
involved in the DDL.), and (c) the complete fix is invasive, even
partial fix is not simple. I have a slight fear that if we make any
mistake in fixing it partially (of course, we can't see any today), we
may not even get a chance to fix it.

Now, if the above convinces you or someone else not to push the
partial fix in 13, then fine; otherwise, I'll push the 0001 to 13 day
after tomorrow.

I've considered the above points. I guess (b), particularly executing
ALTER PUBLICATION .. ADD TABLE while the target table is being
updated, might not be rare depending on systems. Given that this bug
causes a silent data-loss on the subscriber that is hard for users to
realize, it could ultimately depend on to what extent we can mitigate
the problem with only 0001 and there is a workaround when the problem
happens.

Kuroda-san already shared[1]/messages/by-id/OSCPR01MB149664A485A89B0AC6FB7BA71F5DF2@OSCPR01MB14966.jpnprd01.prod.outlook.com the analysis of what happens with and
without 0002 patch, but let me try with the example close to the
original data-loss problem[2]/messages/by-id/20231118025445.crhaeeuvoe2g5dv6@awork3.anarazel.de:

Consider the following scenario:

S1: CREATE TABLE d(data text not null);
S1: INSERT INTO d VALUES('d1');
S2: BEGIN;
S2: INSERT INTO d VALUES('d2');
S1: ALTER PUBLICATION pb ADD TABLE d;
S2: INSERT INTO d VALUES('d3');
S2: COMMIT
S2: INSERT INTO d VALUES('d4');
S1: INSERT INTO d VALUES('d5');

Without 0001 and 0002 (i.e. as of today), the walsender fails to send
all changes to table 'd' until it invalidates its caches for some
reasons.

With only 0001, the walsender sends 'd4' insertion or later.

WIth both 0001 and 0002, the wansender sends 'd3' insertion or later.

ISTM the difference between without both 0001 and 0002 and with 0001
is significant. So I think it's worth applying 0001 for v13.

Regards,

[1]: /messages/by-id/OSCPR01MB149664A485A89B0AC6FB7BA71F5DF2@OSCPR01MB14966.jpnprd01.prod.outlook.com
[2]: /messages/by-id/20231118025445.crhaeeuvoe2g5dv6@awork3.anarazel.de

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#103

Amit Kapila

amit.kapila16@gmail.com

9 months ago

In reply to: Masahiko Sawada (#102)

Re: long-standing data loss bug in initial sync of logical replication

On Wed, Apr 23, 2025 at 10:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Fair enough. OTOH, we can leave the 13 branch considering following:
(a) it is near EOL, (b) bug happens in rare cases (when the DDLs like
ALTER PUBLICATION ... ADD TABLE ... or ALTER TYPE ... that don't take
a strong lock on table happens concurrently to DMLs on the tables
involved in the DDL.), and (c) the complete fix is invasive, even
partial fix is not simple. I have a slight fear that if we make any
mistake in fixing it partially (of course, we can't see any today), we
may not even get a chance to fix it.

Now, if the above convinces you or someone else not to push the
partial fix in 13, then fine; otherwise, I'll push the 0001 to 13 day
after tomorrow.

I've considered the above points. I guess (b), particularly executing
ALTER PUBLICATION .. ADD TABLE while the target table is being
updated, might not be rare depending on systems. Given that this bug
causes a silent data-loss on the subscriber that is hard for users to
realize, it could ultimately depend on to what extent we can mitigate
the problem with only 0001 and there is a workaround when the problem
happens.

Kuroda-san already shared[1] the analysis of what happens with and
without 0002 patch, but let me try with the example close to the
original data-loss problem[2]:

Consider the following scenario:

S1: CREATE TABLE d(data text not null);
S1: INSERT INTO d VALUES('d1');
S2: BEGIN;
S2: INSERT INTO d VALUES('d2');
S1: ALTER PUBLICATION pb ADD TABLE d;
S2: INSERT INTO d VALUES('d3');
S2: COMMIT
S2: INSERT INTO d VALUES('d4');
S1: INSERT INTO d VALUES('d5');

Without 0001 and 0002 (i.e. as of today), the walsender fails to send
all changes to table 'd' until it invalidates its caches for some
reasons.

With only 0001, the walsender sends 'd4' insertion or later.

WIth both 0001 and 0002, the wansender sends 'd3' insertion or later.

ISTM the difference between without both 0001 and 0002 and with 0001
is significant. So I think it's worth applying 0001 for v13.

Pushed to v13 as well, thanks for sharing the feedback.

--
With Regards,
Amit Kapila.

#104

Shlok Kyal

shlok.kyal.oss@gmail.com

9 months ago

In reply to: Amit Kapila (#103)

6 attachment(s)

Re: long-standing data loss bug in initial sync of logical replication

On Thu, 24 Apr 2025 at 14:39, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Apr 23, 2025 at 10:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Fair enough. OTOH, we can leave the 13 branch considering following:
(a) it is near EOL, (b) bug happens in rare cases (when the DDLs like
ALTER PUBLICATION ... ADD TABLE ... or ALTER TYPE ... that don't take
a strong lock on table happens concurrently to DMLs on the tables
involved in the DDL.), and (c) the complete fix is invasive, even
partial fix is not simple. I have a slight fear that if we make any
mistake in fixing it partially (of course, we can't see any today), we
may not even get a chance to fix it.

Now, if the above convinces you or someone else not to push the
partial fix in 13, then fine; otherwise, I'll push the 0001 to 13 day
after tomorrow.

I've considered the above points. I guess (b), particularly executing
ALTER PUBLICATION .. ADD TABLE while the target table is being
updated, might not be rare depending on systems. Given that this bug
causes a silent data-loss on the subscriber that is hard for users to
realize, it could ultimately depend on to what extent we can mitigate
the problem with only 0001 and there is a workaround when the problem
happens.

Kuroda-san already shared[1] the analysis of what happens with and
without 0002 patch, but let me try with the example close to the
original data-loss problem[2]:

Consider the following scenario:

S1: CREATE TABLE d(data text not null);
S1: INSERT INTO d VALUES('d1');
S2: BEGIN;
S2: INSERT INTO d VALUES('d2');
S1: ALTER PUBLICATION pb ADD TABLE d;
S2: INSERT INTO d VALUES('d3');
S2: COMMIT
S2: INSERT INTO d VALUES('d4');
S1: INSERT INTO d VALUES('d5');

Without 0001 and 0002 (i.e. as of today), the walsender fails to send
all changes to table 'd' until it invalidates its caches for some
reasons.

With only 0001, the walsender sends 'd4' insertion or later.

WIth both 0001 and 0002, the wansender sends 'd3' insertion or later.

ISTM the difference between without both 0001 and 0002 and with 0001
is significant. So I think it's worth applying 0001 for v13.

Pushed to v13 as well, thanks for sharing the feedback.

In the commits, I saw that the filenames are misspelled for files
invalidation_distrubution.out and invalidation_distrubution.spec.
This is present in branches from REL_13 to HEAD. I have attached
patches to fix the same.

Thanks and Regards,
Shlok Kyal

Attachments:

v1_HEAD-0001-Fix-spelling-for-file-names.patchapplication/octet-stream; name=v1_HEAD-0001-Fix-spelling-for-file-names.patchDownload

From dee8b9d98c9889a6ec540476a9af3e0f7595d344 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Thu, 24 Apr 2025 12:29:58 +0530
Subject: [PATCH v1_HEAD] Fix spelling for file names

Rename files invalidation_distrubution.out and
invalidation_distrubution.spec to invalidation_distribution.out
and invalidation_distribution.spec.
---
 contrib/test_decoding/Makefile                                  | 2 +-
 ...alidation_distrubution.out => invalidation_distribution.out} | 0
 contrib/test_decoding/meson.build                               | 2 +-
 ...idation_distrubution.spec => invalidation_distribution.spec} | 0
 4 files changed, 2 insertions(+), 2 deletions(-)
 rename contrib/test_decoding/expected/{invalidation_distrubution.out => invalidation_distribution.out} (100%)
 rename contrib/test_decoding/specs/{invalidation_distrubution.spec => invalidation_distribution.spec} (100%)

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index eef70770674..02e961f4d31 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot \
-	skip_snapshot_restore invalidation_distrubution
+	skip_snapshot_restore invalidation_distribution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distribution.out
similarity index 100%
rename from contrib/test_decoding/expected/invalidation_distrubution.out
rename to contrib/test_decoding/expected/invalidation_distribution.out
diff --git a/contrib/test_decoding/meson.build b/contrib/test_decoding/meson.build
index d5f03fd5e9b..25f6b8a9082 100644
--- a/contrib/test_decoding/meson.build
+++ b/contrib/test_decoding/meson.build
@@ -63,7 +63,7 @@ tests += {
       'twophase_snapshot',
       'slot_creation_error',
       'skip_snapshot_restore',
-      'invalidation_distrubution',
+      'invalidation_distribution',
     ],
     'regress_args': [
       '--temp-config', files('logical.conf'),
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distribution.spec
similarity index 100%
rename from contrib/test_decoding/specs/invalidation_distrubution.spec
rename to contrib/test_decoding/specs/invalidation_distribution.spec
-- 
2.34.1

v1_REL_15-0001-Fix-spelling-for-file-names.patchapplication/octet-stream; name=v1_REL_15-0001-Fix-spelling-for-file-names.patchDownload

From f923bb9dd6eb6c4e3f9f22dfc1cc9c67e1df289a Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Thu, 24 Apr 2025 12:29:58 +0530
Subject: [PATCH v1] Fix spelling for file names

Rename files invalidation_distrubution.out and
invalidation_distrubution.spec to invalidation_distribution.out
and invalidation_distribution.spec.
---
 contrib/test_decoding/Makefile                                  | 2 +-
 ...alidation_distrubution.out => invalidation_distribution.out} | 0
 ...idation_distrubution.spec => invalidation_distribution.spec} | 0
 3 files changed, 1 insertion(+), 1 deletion(-)
 rename contrib/test_decoding/expected/{invalidation_distrubution.out => invalidation_distribution.out} (100%)
 rename contrib/test_decoding/specs/{invalidation_distrubution.spec => invalidation_distribution.spec} (100%)

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index eef70770674..02e961f4d31 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot \
-	skip_snapshot_restore invalidation_distrubution
+	skip_snapshot_restore invalidation_distribution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distribution.out
similarity index 100%
rename from contrib/test_decoding/expected/invalidation_distrubution.out
rename to contrib/test_decoding/expected/invalidation_distribution.out
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distribution.spec
similarity index 100%
rename from contrib/test_decoding/specs/invalidation_distrubution.spec
rename to contrib/test_decoding/specs/invalidation_distribution.spec
-- 
2.34.1

v1_REL_13-0001-Fix-spelling-for-file-names.patchapplication/octet-stream; name=v1_REL_13-0001-Fix-spelling-for-file-names.patchDownload

From cdc82231f74934240e906da6a85227a537571ddb Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Thu, 24 Apr 2025 12:24:51 +0530
Subject: [PATCH v1] Fix spelling for file names

Rename files invalidation_distrubution.out and
invalidation_distrubution.spec to invalidation_distribution.out
and invalidation_distribution.spec.
---
 contrib/test_decoding/Makefile                                  | 2 +-
 ...alidation_distrubution.out => invalidation_distribution.out} | 0
 ...idation_distrubution.spec => invalidation_distribution.spec} | 0
 3 files changed, 1 insertion(+), 1 deletion(-)
 rename contrib/test_decoding/expected/{invalidation_distrubution.out => invalidation_distribution.out} (100%)
 rename contrib/test_decoding/specs/{invalidation_distrubution.spec => invalidation_distribution.spec} (100%)

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f122dc3a82d..931f82ab619 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top catalog_change_snapshot \
-	skip_snapshot_restore invalidation_distrubution
+	skip_snapshot_restore invalidation_distribution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distribution.out
similarity index 100%
rename from contrib/test_decoding/expected/invalidation_distrubution.out
rename to contrib/test_decoding/expected/invalidation_distribution.out
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distribution.spec
similarity index 100%
rename from contrib/test_decoding/specs/invalidation_distrubution.spec
rename to contrib/test_decoding/specs/invalidation_distribution.spec
-- 
2.34.1

v1_REL_16-0001-Fix-spelling-for-file-names.patchapplication/octet-stream; name=v1_REL_16-0001-Fix-spelling-for-file-names.patchDownload

From b1c91bbbf2d268707f9d78cf4bd17e000bd50fec Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Thu, 24 Apr 2025 12:29:58 +0530
Subject: [PATCH v1_REL_16] Fix spelling for file names

Rename files invalidation_distrubution.out and
invalidation_distrubution.spec to invalidation_distribution.out
and invalidation_distribution.spec.
---
 contrib/test_decoding/Makefile                                  | 2 +-
 ...alidation_distrubution.out => invalidation_distribution.out} | 0
 contrib/test_decoding/meson.build                               | 2 +-
 ...idation_distrubution.spec => invalidation_distribution.spec} | 0
 4 files changed, 2 insertions(+), 2 deletions(-)
 rename contrib/test_decoding/expected/{invalidation_distrubution.out => invalidation_distribution.out} (100%)
 rename contrib/test_decoding/specs/{invalidation_distrubution.spec => invalidation_distribution.spec} (100%)

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index eef70770674..02e961f4d31 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot \
-	skip_snapshot_restore invalidation_distrubution
+	skip_snapshot_restore invalidation_distribution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distribution.out
similarity index 100%
rename from contrib/test_decoding/expected/invalidation_distrubution.out
rename to contrib/test_decoding/expected/invalidation_distribution.out
diff --git a/contrib/test_decoding/meson.build b/contrib/test_decoding/meson.build
index 273d26643c0..3c99ba472c0 100644
--- a/contrib/test_decoding/meson.build
+++ b/contrib/test_decoding/meson.build
@@ -63,7 +63,7 @@ tests += {
       'twophase_snapshot',
       'slot_creation_error',
       'skip_snapshot_restore',
-      'invalidation_distrubution',
+      'invalidation_distribution',
     ],
     'regress_args': [
       '--temp-config', files('logical.conf'),
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distribution.spec
similarity index 100%
rename from contrib/test_decoding/specs/invalidation_distrubution.spec
rename to contrib/test_decoding/specs/invalidation_distribution.spec
-- 
2.34.1

v1_REL_14-0001-Fix-spelling-for-file-names.patchapplication/octet-stream; name=v1_REL_14-0001-Fix-spelling-for-file-names.patchDownload

From 3b9395ef2af44ebf914f07d52194fb4823631e35 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Thu, 24 Apr 2025 13:25:08 +0530
Subject: [PATCH v1] Fix spelling for file names

Rename files invalidation_distrubution.out and
invalidation_distrubution.spec to invalidation_distribution.out
and invalidation_distribution.spec.
---
 contrib/test_decoding/Makefile                                  | 2 +-
 ...alidation_distrubution.out => invalidation_distribution.out} | 0
 ...idation_distrubution.spec => invalidation_distribution.spec} | 0
 3 files changed, 1 insertion(+), 1 deletion(-)
 rename contrib/test_decoding/expected/{invalidation_distrubution.out => invalidation_distribution.out} (100%)
 rename contrib/test_decoding/specs/{invalidation_distrubution.spec => invalidation_distribution.spec} (100%)

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 8b0b8cc3acf..1e5f4f6b1cd 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot catalog_change_snapshot skip_snapshot_restore \
-	invalidation_distrubution
+	invalidation_distribution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distribution.out
similarity index 100%
rename from contrib/test_decoding/expected/invalidation_distrubution.out
rename to contrib/test_decoding/expected/invalidation_distribution.out
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distribution.spec
similarity index 100%
rename from contrib/test_decoding/specs/invalidation_distrubution.spec
rename to contrib/test_decoding/specs/invalidation_distribution.spec
-- 
2.34.1

v1_REL_17-0001-Fix-spelling-for-file-names.patchapplication/octet-stream; name=v1_REL_17-0001-Fix-spelling-for-file-names.patchDownload

From 0feea36896a0236bea1162d2ddeec7c169422fff Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Thu, 24 Apr 2025 12:29:58 +0530
Subject: [PATCH v1_REL_17] Fix spelling for file names

Rename files invalidation_distrubution.out and
invalidation_distrubution.spec to invalidation_distribution.out
and invalidation_distribution.spec.
---
 contrib/test_decoding/Makefile                                  | 2 +-
 ...alidation_distrubution.out => invalidation_distribution.out} | 0
 contrib/test_decoding/meson.build                               | 2 +-
 ...idation_distrubution.spec => invalidation_distribution.spec} | 0
 4 files changed, 2 insertions(+), 2 deletions(-)
 rename contrib/test_decoding/expected/{invalidation_distrubution.out => invalidation_distribution.out} (100%)
 rename contrib/test_decoding/specs/{invalidation_distrubution.spec => invalidation_distribution.spec} (100%)

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index eef70770674..02e961f4d31 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot \
-	skip_snapshot_restore invalidation_distrubution
+	skip_snapshot_restore invalidation_distribution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out b/contrib/test_decoding/expected/invalidation_distribution.out
similarity index 100%
rename from contrib/test_decoding/expected/invalidation_distrubution.out
rename to contrib/test_decoding/expected/invalidation_distribution.out
diff --git a/contrib/test_decoding/meson.build b/contrib/test_decoding/meson.build
index b31c433681d..03dd80b7f19 100644
--- a/contrib/test_decoding/meson.build
+++ b/contrib/test_decoding/meson.build
@@ -63,7 +63,7 @@ tests += {
       'twophase_snapshot',
       'slot_creation_error',
       'skip_snapshot_restore',
-      'invalidation_distrubution',
+      'invalidation_distribution',
     ],
     'regress_args': [
       '--temp-config', files('logical.conf'),
diff --git a/contrib/test_decoding/specs/invalidation_distrubution.spec b/contrib/test_decoding/specs/invalidation_distribution.spec
similarity index 100%
rename from contrib/test_decoding/specs/invalidation_distrubution.spec
rename to contrib/test_decoding/specs/invalidation_distribution.spec
-- 
2.34.1

#105

Amit Kapila

amit.kapila16@gmail.com

9 months ago

In reply to: Shlok Kyal (#104)

Re: long-standing data loss bug in initial sync of logical replication

On Fri, Apr 25, 2025 at 10:45 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

In the commits, I saw that the filenames are misspelled for files
invalidation_distrubution.out and invalidation_distrubution.spec.
This is present in branches from REL_13 to HEAD. I have attached
patches to fix the same.

Thanks for spotting the problem. I've pushed your patch.

--
With Regards,
Amit Kapila.