Apply worker fails if a relation is missing on subscriber even if refresh publication has not been refreshed yet

Started by Melih Mutluover 3 years ago4 messageshackers
Jump to latest
#1Melih Mutlu
m.melihmutlu@gmail.com

Hi hackers,

I realized a behaviour of logical replication that seems unexpected to me,
but not totally sure.

Let's say a new table is created and added into a publication and not
created on subscriber yet. Also "ALTER SUBSCRIPTION ... REFRESH
PUBLICATION" has not been called yet.
What I expect in that case would be that logical replication continues to
work as it was working before the new table was created. The new table does
not get replicated until "REFRESH PUBLICATION" as stated here [1]https://www.postgresql.org/docs/current/sql-altersubscription.html.
This is indeed how it actually seems to work. Until we insert a row into
the new table.

After a new row into the new table, the apply worker gets this change and
tries to apply it. As expected, it fails since the table does not exist on
the subscriber yet. And the worker keeps crashing without and can't apply
any changes for any table.
The obvious way to resolve this is creating the table on subscriber as
well. After that apply worker will be back to work and skip changes for the
new table and move to other changes.
Since REFRESH PUBLICATION is not called yet, any change for the new table
will not be replicated.

If replication of the new table will not start anyway (until REFRESH
PUBLICATION), do we really need to have that table on the subscriber for
apply worker to work?
AFAIU any change on publication would not affect logical replication setup
until the publication gets refreshed on subscriber. If this understanding
is correct, then apply worker should be able to run without needing new
tables.
What do you think?

Also; if you agree, then the attached patch attempts to fix this issue.
It relies on the info from pg_subscription_rel so that apply worker only
applies changes for the relations exist in pg_subscription_rel.
Since new tables wouldn't be in there until the next REFRESH PUBLICATION,
missing those tables won't be a problem for existing subscriptions.

[1]: https://www.postgresql.org/docs/current/sql-altersubscription.html

Thanks,
--
Melih Mutlu
Microsoft

Attachments:

0001-Do-not-apply-for-tables-which-not-exist-on-catalog.patchapplication/octet-stream; name=0001-Do-not-apply-for-tables-which-not-exist-on-catalog.patchDownload+59-9
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Melih Mutlu (#1)
Re: Apply worker fails if a relation is missing on subscriber even if refresh publication has not been refreshed yet

On Thu, Dec 22, 2022 at 7:16 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:

Hi hackers,

I realized a behaviour of logical replication that seems unexpected to me, but not totally sure.

Let's say a new table is created and added into a publication and not created on subscriber yet. Also "ALTER SUBSCRIPTION ... REFRESH PUBLICATION" has not been called yet.
What I expect in that case would be that logical replication continues to work as it was working before the new table was created. The new table does not get replicated until "REFRESH PUBLICATION" as stated here [1].
This is indeed how it actually seems to work. Until we insert a row into the new table.

After a new row into the new table, the apply worker gets this change and tries to apply it. As expected, it fails since the table does not exist on the subscriber yet. And the worker keeps crashing without and can't apply any changes for any table.
The obvious way to resolve this is creating the table on subscriber as well. After that apply worker will be back to work and skip changes for the new table and move to other changes.
Since REFRESH PUBLICATION is not called yet, any change for the new table will not be replicated.

If replication of the new table will not start anyway (until REFRESH PUBLICATION), do we really need to have that table on the subscriber for apply worker to work?
AFAIU any change on publication would not affect logical replication setup until the publication gets refreshed on subscriber.

I also have the same understanding but I think if we skip replicating
some table due to the reason that the corresponding publication has
not been refreshed then it is better to LOG that information instead
of silently skipping it. Along similar lines, personally, I don't see
a very strong reason to not throw the ERROR in the case you mentioned.
Do you have any use case in mind where the user has added a table to
the publication even though she doesn't want it to be replicated? One
thing that came to my mind is that due to some reason after adding a
table to the publication, there is some delay in creating the table on
the subscriber and then refreshing the publication and during that
time user expects replication to proceed smoothly. But for that isn't
it better that the user completes the setup on the subscriber before
performing operations on such a table? Because say there is some error
in the subscriber-side setup that the user misses then it would be a
surprise for a user to not see the table data. In such a case, an
ERROR/LOG information could be helpful for users.

--
With Regards,
Amit Kapila.

#3Melih Mutlu
m.melihmutlu@gmail.com
In reply to: Amit Kapila (#2)
Re: Apply worker fails if a relation is missing on subscriber even if refresh publication has not been refreshed yet

Hi Amit,

Amit Kapila <amit.kapila16@gmail.com>, 23 Ara 2022 Cum, 09:39 tarihinde
şunu yazdı:

I also have the same understanding but I think if we skip replicating
some table due to the reason that the corresponding publication has
not been refreshed then it is better to LOG that information instead
of silently skipping it.

By skipping it, I mean the apply worker does not try to do anything with
the changes for the missing table since the worker simply cannot apply it
and only fails.
But I agree with you about logging it, the patch currently logs such cases
as warnings instead of errors.
I can make it LOG instead of WARNING, just wanted to make something
different than ERROR.

Do you have any use case in mind where the user has added a table to

the publication even though she doesn't want it to be replicated? One
thing that came to my mind is that due to some reason after adding a
table to the publication, there is some delay in creating the table on
the subscriber and then refreshing the publication and during that
time user expects replication to proceed smoothly. But for that isn't
it better that the user completes the setup on the subscriber before
performing operations on such a table? Because say there is some error
in the subscriber-side setup that the user misses then it would be a
surprise for a user to not see the table data. In such a case, an
ERROR/LOG information could be helpful for users.

I don't really see a specific use case for this. The delay between creating
a table on publisher and then on subscriber usually may not be even
that long to hurt anything. It just seems unnecessary to me that apply
worker goes into a failure loop until someone creates the table on the
subscriber, even though the table will not be replicated immediately.

Users also shouldn't expect for such tables to be replicated if they did
not refresh the publication. That will not happen with or without this
change. So I don't think it would be a surprise when they see their new
table has not been replicated yet. This issue will also be visible in the
logs, just not as an error.
And if users decide/remember to refresh the publication, they cannot do
that anyway if the table is still missing on the subscriber. So the REFRESH
PUBLICATION command will fail and then users will see an error log.

Best,
--
Melih Mutlu
Microsoft

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Melih Mutlu (#3)
Re: Apply worker fails if a relation is missing on subscriber even if refresh publication has not been refreshed yet

On Mon, Dec 26, 2022 at 3:41 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:

Do you have any use case in mind where the user has added a table to
the publication even though she doesn't want it to be replicated? One
thing that came to my mind is that due to some reason after adding a
table to the publication, there is some delay in creating the table on
the subscriber and then refreshing the publication and during that
time user expects replication to proceed smoothly. But for that isn't
it better that the user completes the setup on the subscriber before
performing operations on such a table? Because say there is some error
in the subscriber-side setup that the user misses then it would be a
surprise for a user to not see the table data. In such a case, an
ERROR/LOG information could be helpful for users.

I don't really see a specific use case for this. The delay between creating a table on publisher and then on subscriber usually may not be even that long to hurt anything. It just seems unnecessary to me that apply worker goes into a failure loop until someone creates the table on the subscriber, even though the table will not be replicated immediately.

To avoid the failure loop, users can use disable_on_error subscription
parameter. I see your point but not sure if it is worth changing the
current behavior without any specific use case which we want to
address with this change.

--
With Regards,
Amit Kapila.