Apply worker fails if a relation is missing on subscriber even if refresh publication has not been refreshed yet

Started by Melih Mutluabout 3 years ago4 messages

m.melihmutlu@gmail.com

about 3 years ago

1 attachment(s)

Hi hackers,

I realized a behaviour of logical replication that seems unexpected to me,
but not totally sure.

Let's say a new table is created and added into a publication and not
created on subscriber yet. Also "ALTER SUBSCRIPTION ... REFRESH
PUBLICATION" has not been called yet.
What I expect in that case would be that logical replication continues to
work as it was working before the new table was created. The new table does
not get replicated until "REFRESH PUBLICATION" as stated here [1]https://www.postgresql.org/docs/current/sql-altersubscription.html.
This is indeed how it actually seems to work. Until we insert a row into
the new table.

After a new row into the new table, the apply worker gets this change and
tries to apply it. As expected, it fails since the table does not exist on
the subscriber yet. And the worker keeps crashing without and can't apply
any changes for any table.
The obvious way to resolve this is creating the table on subscriber as
well. After that apply worker will be back to work and skip changes for the
new table and move to other changes.
Since REFRESH PUBLICATION is not called yet, any change for the new table
will not be replicated.

If replication of the new table will not start anyway (until REFRESH
PUBLICATION), do we really need to have that table on the subscriber for
apply worker to work?
AFAIU any change on publication would not affect logical replication setup
until the publication gets refreshed on subscriber. If this understanding
is correct, then apply worker should be able to run without needing new
tables.
What do you think?

Also; if you agree, then the attached patch attempts to fix this issue.
It relies on the info from pg_subscription_rel so that apply worker only
applies changes for the relations exist in pg_subscription_rel.
Since new tables wouldn't be in there until the next REFRESH PUBLICATION,
missing those tables won't be a problem for existing subscriptions.

[1]: https://www.postgresql.org/docs/current/sql-altersubscription.html

Thanks,
--
Melih Mutlu
Microsoft

Attachments:

0001-Do-not-apply-for-tables-which-not-exist-on-catalog.patchapplication/octet-stream; name=0001-Do-not-apply-for-tables-which-not-exist-on-catalog.patchDownload

From f7d28477ea1fe41f1ed52014c59f09e10e5be2ae Mon Sep 17 00:00:00 2001
From: Melih Mutlu <m.melihmutlu@gmail.com>
Date: Thu, 6 Oct 2022 15:58:55 +0300
Subject: [PATCH] Do not apply for tables which not exist on catalog

If a new table is missing on subscriber and ALTER SUBSCRIPTION ...
REFRESH PUBLICATION has not been run yet after the tables was created
on publisher, show WARNING and HINT messages on subscriber logs.

Before this change, operations on such a recently added table were
causing constant logical replication worker failures until the table
was created on subscriber. There is no need to fail in this case.
Regardless of whether the table exists on subscriber, newly added tables
will not be replicated to subscriber until ALTER SUBSCRIPTION ... REFRESH
PUBLICATION is called.

The patch relies on the information from pg_subscription_rel. Apply
worker only applies changes for tables exist on pg_subscription_rel. New
tables wouldn't exist in the catalog until the next REFRESH PUBLICATION.
---
 src/backend/catalog/pg_subscription.c      | 34 ++++++++++++++++++++++
 src/backend/replication/logical/relation.c | 12 ++++++++
 src/backend/replication/logical/worker.c   | 20 ++++++++-----
 src/include/catalog/pg_subscription_rel.h  |  1 +
 4 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index a506fc3ec8..01ae95929c 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -543,3 +543,37 @@ GetSubscriptionRelations(Oid subid, bool not_ready)
 
 	return res;
 }
+
+/*
+ * Check whether the subscription has the given relation.
+ *
+ * Returns true if the relation exists in the subscription, false otherwise.
+ */
+bool
+CheckSubscriptionRelation(Oid subid, Oid relid)
+{
+	HeapTuple	tup;
+	Relation	rel;
+	bool		result = false;
+
+	/*
+	 * This is to avoid the race condition with AlterSubscription which tries
+	 * to remove this relstate.
+	 */
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	/* Try finding the mapping. */
+	tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
+							  ObjectIdGetDatum(relid),
+							  ObjectIdGetDatum(subid));
+
+	if (HeapTupleIsValid(tup))
+	{
+		result = true;
+	}
+
+	/* Cleanup */
+	table_close(rel, AccessShareLock);
+
+	return result;
+}
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index e989047681..994421a2d7 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -384,6 +384,18 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 		relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,
 											  remoterel->relname, -1),
 								 lockmode, true);
+
+		/* Check if relation really exists for the subscription. */
+		if (!CheckSubscriptionRelation(MySubscription->oid, relid))
+		{
+			ereport(WARNING,
+					errmsg("Subscription \"%s\" does not have a relation named \"%s.%s\".",
+						   MySubscription->name, remoterel->nspname, remoterel->relname),
+					errhint("Try to run \"ALTER SUBSCRIPTION %s REFRESH PUBLICATION\".",
+							MySubscription->name));
+			return NULL;
+		}
+
 		if (!OidIsValid(relid))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 96772e4d73..e0cde9df37 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1691,13 +1691,14 @@ apply_handle_insert(StringInfo s)
 
 	relid = logicalrep_read_insert(s, &newtup);
 	rel = logicalrep_rel_open(relid, RowExclusiveLock);
-	if (!should_apply_changes_for_rel(rel))
+	if (!rel || !should_apply_changes_for_rel(rel))
 	{
 		/*
 		 * The relation can't become interesting in the middle of the
 		 * transaction so it's safe to unlock it.
 		 */
-		logicalrep_rel_close(rel, RowExclusiveLock);
+		if (rel)
+			logicalrep_rel_close(rel, RowExclusiveLock);
 		end_replication_step();
 		return;
 	}
@@ -1832,13 +1833,14 @@ apply_handle_update(StringInfo s)
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
 								   &newtup);
 	rel = logicalrep_rel_open(relid, RowExclusiveLock);
-	if (!should_apply_changes_for_rel(rel))
+	if (!rel || !should_apply_changes_for_rel(rel))
 	{
 		/*
 		 * The relation can't become interesting in the middle of the
 		 * transaction so it's safe to unlock it.
 		 */
-		logicalrep_rel_close(rel, RowExclusiveLock);
+		if (rel)
+			logicalrep_rel_close(rel, RowExclusiveLock);
 		end_replication_step();
 		return;
 	}
@@ -2001,13 +2003,14 @@ apply_handle_delete(StringInfo s)
 
 	relid = logicalrep_read_delete(s, &oldtup);
 	rel = logicalrep_rel_open(relid, RowExclusiveLock);
-	if (!should_apply_changes_for_rel(rel))
+	if (!rel || !should_apply_changes_for_rel(rel))
 	{
 		/*
 		 * The relation can't become interesting in the middle of the
 		 * transaction so it's safe to unlock it.
 		 */
-		logicalrep_rel_close(rel, RowExclusiveLock);
+		if (rel)
+			logicalrep_rel_close(rel, RowExclusiveLock);
 		end_replication_step();
 		return;
 	}
@@ -2421,13 +2424,14 @@ apply_handle_truncate(StringInfo s)
 		LogicalRepRelMapEntry *rel;
 
 		rel = logicalrep_rel_open(relid, lockmode);
-		if (!should_apply_changes_for_rel(rel))
+		if (!rel || !should_apply_changes_for_rel(rel))
 		{
 			/*
 			 * The relation can't become interesting in the middle of the
 			 * transaction so it's safe to unlock it.
 			 */
-			logicalrep_rel_close(rel, lockmode);
+			if (rel)
+				logicalrep_rel_close(rel, lockmode);
 			continue;
 		}
 
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 8e88de7b2b..33cc75b656 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -89,5 +89,6 @@ extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
 extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid, bool not_ready);
+extern bool CheckSubscriptionRelation(Oid subid, Oid relid);
 
 #endif							/* PG_SUBSCRIPTION_REL_H */
-- 
2.25.1

Amit Kapila

amit.kapila16@gmail.com

about 3 years ago

In reply to: Melih Mutlu (#1)

Re: Apply worker fails if a relation is missing on subscriber even if refresh publication has not been refreshed yet

On Thu, Dec 22, 2022 at 7:16 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:

Hi hackers,

I realized a behaviour of logical replication that seems unexpected to me, but not totally sure.

Let's say a new table is created and added into a publication and not created on subscriber yet. Also "ALTER SUBSCRIPTION ... REFRESH PUBLICATION" has not been called yet.
What I expect in that case would be that logical replication continues to work as it was working before the new table was created. The new table does not get replicated until "REFRESH PUBLICATION" as stated here [1].
This is indeed how it actually seems to work. Until we insert a row into the new table.

After a new row into the new table, the apply worker gets this change and tries to apply it. As expected, it fails since the table does not exist on the subscriber yet. And the worker keeps crashing without and can't apply any changes for any table.
The obvious way to resolve this is creating the table on subscriber as well. After that apply worker will be back to work and skip changes for the new table and move to other changes.
Since REFRESH PUBLICATION is not called yet, any change for the new table will not be replicated.

If replication of the new table will not start anyway (until REFRESH PUBLICATION), do we really need to have that table on the subscriber for apply worker to work?
AFAIU any change on publication would not affect logical replication setup until the publication gets refreshed on subscriber.

I also have the same understanding but I think if we skip replicating
some table due to the reason that the corresponding publication has
not been refreshed then it is better to LOG that information instead
of silently skipping it. Along similar lines, personally, I don't see
a very strong reason to not throw the ERROR in the case you mentioned.
Do you have any use case in mind where the user has added a table to
the publication even though she doesn't want it to be replicated? One
thing that came to my mind is that due to some reason after adding a
table to the publication, there is some delay in creating the table on
the subscriber and then refreshing the publication and during that
time user expects replication to proceed smoothly. But for that isn't
it better that the user completes the setup on the subscriber before
performing operations on such a table? Because say there is some error
in the subscriber-side setup that the user misses then it would be a
surprise for a user to not see the table data. In such a case, an
ERROR/LOG information could be helpful for users.

--
With Regards,
Amit Kapila.

Melih Mutlu

m.melihmutlu@gmail.com

about 3 years ago

In reply to: Amit Kapila (#2)

Re: Apply worker fails if a relation is missing on subscriber even if refresh publication has not been refreshed yet

Hi Amit,

Amit Kapila <amit.kapila16@gmail.com>, 23 Ara 2022 Cum, 09:39 tarihinde
şunu yazdı:

I also have the same understanding but I think if we skip replicating
some table due to the reason that the corresponding publication has
not been refreshed then it is better to LOG that information instead
of silently skipping it.

By skipping it, I mean the apply worker does not try to do anything with
the changes for the missing table since the worker simply cannot apply it
and only fails.
But I agree with you about logging it, the patch currently logs such cases
as warnings instead of errors.
I can make it LOG instead of WARNING, just wanted to make something
different than ERROR.

Do you have any use case in mind where the user has added a table to

the publication even though she doesn't want it to be replicated? One
thing that came to my mind is that due to some reason after adding a
table to the publication, there is some delay in creating the table on
the subscriber and then refreshing the publication and during that
time user expects replication to proceed smoothly. But for that isn't
it better that the user completes the setup on the subscriber before
performing operations on such a table? Because say there is some error
in the subscriber-side setup that the user misses then it would be a
surprise for a user to not see the table data. In such a case, an
ERROR/LOG information could be helpful for users.

I don't really see a specific use case for this. The delay between creating
a table on publisher and then on subscriber usually may not be even
that long to hurt anything. It just seems unnecessary to me that apply
worker goes into a failure loop until someone creates the table on the
subscriber, even though the table will not be replicated immediately.

Users also shouldn't expect for such tables to be replicated if they did
not refresh the publication. That will not happen with or without this
change. So I don't think it would be a surprise when they see their new
table has not been replicated yet. This issue will also be visible in the
logs, just not as an error.
And if users decide/remember to refresh the publication, they cannot do
that anyway if the table is still missing on the subscriber. So the REFRESH
PUBLICATION command will fail and then users will see an error log.

Best,
--
Melih Mutlu
Microsoft

Amit Kapila

amit.kapila16@gmail.com

about 3 years ago

In reply to: Melih Mutlu (#3)

Re: Apply worker fails if a relation is missing on subscriber even if refresh publication has not been refreshed yet

On Mon, Dec 26, 2022 at 3:41 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:

Do you have any use case in mind where the user has added a table to
the publication even though she doesn't want it to be replicated? One
thing that came to my mind is that due to some reason after adding a
table to the publication, there is some delay in creating the table on
the subscriber and then refreshing the publication and during that
time user expects replication to proceed smoothly. But for that isn't
it better that the user completes the setup on the subscriber before
performing operations on such a table? Because say there is some error
in the subscriber-side setup that the user misses then it would be a
surprise for a user to not see the table data. In such a case, an
ERROR/LOG information could be helpful for users.

I don't really see a specific use case for this. The delay between creating a table on publisher and then on subscriber usually may not be even that long to hurt anything. It just seems unnecessary to me that apply worker goes into a failure loop until someone creates the table on the subscriber, even though the table will not be replicated immediately.

To avoid the failure loop, users can use disable_on_error subscription
parameter. I see your point but not sure if it is worth changing the
current behavior without any specific use case which we want to
address with this change.

--
With Regards,
Amit Kapila.