Add an option to skip loading missing publication to avoid logical replication failure

Started by vignesh Cabout 2 years ago48 messages
Jump to latest
#1vignesh C
vignesh21@gmail.com

Hi,

Currently ALTER SUBSCRIPTION ... SET PUBLICATION will break the
logical replication in certain cases. This can happen as the apply
worker will get restarted after SET PUBLICATION, the apply worker will
use the existing slot and replication origin corresponding to the
subscription. Now, it is possible that before restart the origin has
not been updated and the WAL start location points to a location prior
to where PUBLICATION pub exists which can lead to such an error. Once
this error occurs, apply worker will never be able to proceed and will
always return the same error.

There was discussion on this and Amit had posted a patch to handle
this at [2]. Amit's patch does continue using a historic snapshot but
ignores publications that are not found for the purpose of computing
RelSyncEntry attributes. We won't mark such an entry as valid till all
the publications are loaded without anything missing. This means we
won't publish operations on tables corresponding to that publication
till we found such a publication and that seems okay.
I have added an option skip_not_exist_publication to enable this
operation only when skip_not_exist_publication is specified as true.
There is no change in default behavior when skip_not_exist_publication
is specified as false.

But one thing to note with the patch (with skip_not_exist_publication
option) is that replication of few WAL entries will be skipped till
the publication is loaded like in the below example:
-- Create table in publisher and subscriber
create table t1(c1 int);
create table t2(c1 int);

-- Create publications
create publication pub1 for table t1;
create publication pub2 for table t2;

-- Create subscription
create subscription test1 connection 'dbname=postgres host=localhost
port=5432' publication pub1, pub2;

-- Drop one publication
drop publication pub1;

-- Insert in the publisher
insert into t1 values(11);
insert into t2 values(21);

-- Select in subscriber
postgres=# select * from t1;
c1
----
(0 rows)

postgres=# select * from t2;
c1
----
21
(1 row)

-- Create the dropped publication in publisher
create publication pub1 for table t1;

-- Insert in the publisher
insert into t1 values(12);
postgres=# select * from t1;
c1
----
11
12
(2 rows)

-- Select data in subscriber
postgres=# select * from t1; -- record with value 11 will be missing
in subscriber
c1
----
12
(1 row)

Thoughts?

[1]: /messages/by-id/CAA4eK1+T-ETXeRM4DHWzGxBpKafLCp__5bPA_QZfFQp7-0wj4Q@mail.gmail.com

Regards,
Vignesh

Attachments:

v1-0001-Skip-loading-the-publication-if-the-publication-d.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Skip-loading-the-publication-if-the-publication-d.patchDownload+21-8
v1-0002-Added-an-option-skip_not_exist_publication-which-.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Added-an-option-skip_not_exist_publication-which-.patchDownload+82-18
#2vignesh C
vignesh21@gmail.com
In reply to: vignesh C (#1)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Mon, 19 Feb 2024 at 12:48, vignesh C <vignesh21@gmail.com> wrote:

Hi,

Currently ALTER SUBSCRIPTION ... SET PUBLICATION will break the
logical replication in certain cases. This can happen as the apply
worker will get restarted after SET PUBLICATION, the apply worker will
use the existing slot and replication origin corresponding to the
subscription. Now, it is possible that before restart the origin has
not been updated and the WAL start location points to a location prior
to where PUBLICATION pub exists which can lead to such an error. Once
this error occurs, apply worker will never be able to proceed and will
always return the same error.

There was discussion on this and Amit had posted a patch to handle
this at [2]. Amit's patch does continue using a historic snapshot but
ignores publications that are not found for the purpose of computing
RelSyncEntry attributes. We won't mark such an entry as valid till all
the publications are loaded without anything missing. This means we
won't publish operations on tables corresponding to that publication
till we found such a publication and that seems okay.
I have added an option skip_not_exist_publication to enable this
operation only when skip_not_exist_publication is specified as true.
There is no change in default behavior when skip_not_exist_publication
is specified as false.

I have updated the patch to now include changes for pg_dump, added few
tests, describe changes and added documentation changes. The attached
v2 version patch has the changes for the same.

Regards,
Vignesh

Attachments:

v2-0001-Skip-loading-the-publication-if-the-publication-d.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Skip-loading-the-publication-if-the-publication-d.patchDownload+21-8
v2-0002-Added-an-option-skip_not_exist_publication-which-.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Added-an-option-skip_not_exist_publication-which-.patchDownload+272-81
#3Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#1)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Mon, Feb 19, 2024 at 12:49 PM vignesh C <vignesh21@gmail.com> wrote:

Currently ALTER SUBSCRIPTION ... SET PUBLICATION will break the
logical replication in certain cases. This can happen as the apply
worker will get restarted after SET PUBLICATION, the apply worker will
use the existing slot and replication origin corresponding to the
subscription. Now, it is possible that before restart the origin has
not been updated and the WAL start location points to a location prior
to where PUBLICATION pub exists which can lead to such an error. Once
this error occurs, apply worker will never be able to proceed and will
always return the same error.

There was discussion on this and Amit had posted a patch to handle
this at [2]. Amit's patch does continue using a historic snapshot but
ignores publications that are not found for the purpose of computing
RelSyncEntry attributes. We won't mark such an entry as valid till all
the publications are loaded without anything missing. This means we
won't publish operations on tables corresponding to that publication
till we found such a publication and that seems okay.
I have added an option skip_not_exist_publication to enable this
operation only when skip_not_exist_publication is specified as true.
There is no change in default behavior when skip_not_exist_publication
is specified as false.

Did you try to measure the performance impact of this change? We can
try a few cases where DDL and DMLs are involved, missing publication
(drop publication and recreate after a varying number of records to
check the impact).

The other names for the option could be:
skip_notexistant_publications, or ignore_nonexistant_publications. Can
we think of any others?

--
With Regards,
Amit Kapila.

#4vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#3)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Fri, 14 Feb 2025 at 15:36, Amit Kapila <amit.kapila16@gmail.com> wrote:

Did you try to measure the performance impact of this change? We can
try a few cases where DDL and DMLs are involved, missing publication
(drop publication and recreate after a varying number of records to
check the impact).

Since we don't have an exact scenario to compare with the patch
(because, in the current HEAD, when the publication is missing, an
error is thrown and the walsender/worker restarts), I compared the
positive case, where records are successfully replicated to the
subscriber, as shown below. For the scenario with the patch, I ran the
same test, where the publication is dropped before the insert,
allowing the walsender to check whether the publication is present.
The test results, which represent the median of 7 runs and the
execution run is in milliseconds, are provided below:

Brach/records | 100 | 1000 | 10000 | 100000 | 1000000
Head | 1.214 | 2.548 | 10.823 | 90.3 | 951.833
Patch | 1.215 | 2.5485 | 10.8545 | 90.94 | 955.134
% diff | 0.082 | 0.020 | 0.291 | 0.704 | 0.347

I noticed that the test run with patches is very negligible. The
scripts used for execution are attached.

The other names for the option could be:
skip_notexistant_publications, or ignore_nonexistant_publications. Can
we think of any others?

How about using ignore_missing_publications or skip_missing_publications?

Regards,
Vignesh

Attachments:

v3-0001-Skip-loading-the-publication-if-the-publication-d.patchapplication/octet-stream; name=v3-0001-Skip-loading-the-publication-if-the-publication-d.patchDownload+21-8
v3-0002-Added-an-option-skip_not_exist_publication-which-.patchapplication/octet-stream; name=v3-0002-Added-an-option-skip_not_exist_publication-which-.patchDownload+272-80
test_skip_pub_1702.shtext/x-sh; charset=US-ASCII; name=test_skip_pub_1702.shDownload
insert.shtext/x-sh; charset=US-ASCII; name=insert.shDownload
#5Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#4)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Tue, Feb 18, 2025 at 2:24 PM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 14 Feb 2025 at 15:36, Amit Kapila <amit.kapila16@gmail.com> wrote:

Did you try to measure the performance impact of this change? We can
try a few cases where DDL and DMLs are involved, missing publication
(drop publication and recreate after a varying number of records to
check the impact).

Since we don't have an exact scenario to compare with the patch
(because, in the current HEAD, when the publication is missing, an
error is thrown and the walsender/worker restarts), I compared the
positive case, where records are successfully replicated to the
subscriber, as shown below. For the scenario with the patch, I ran the
same test, where the publication is dropped before the insert,
allowing the walsender to check whether the publication is present.
The test results, which represent the median of 7 runs and the
execution run is in milliseconds, are provided below:

Brach/records | 100 | 1000 | 10000 | 100000 | 1000000
Head | 1.214 | 2.548 | 10.823 | 90.3 | 951.833
Patch | 1.215 | 2.5485 | 10.8545 | 90.94 | 955.134
% diff | 0.082 | 0.020 | 0.291 | 0.704 | 0.347

I noticed that the test run with patches is very negligible. The
scripts used for execution are attached.

You have used the synchronous_standby_name to evaluate the performance
which covers other parts of replication than the logical decoding. It
would be better to test using pg_recvlogical.

--
With Regards,
Amit Kapila.

#6vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#5)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Tue, 18 Feb 2025 at 16:53, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Feb 18, 2025 at 2:24 PM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 14 Feb 2025 at 15:36, Amit Kapila <amit.kapila16@gmail.com> wrote:

Did you try to measure the performance impact of this change? We can
try a few cases where DDL and DMLs are involved, missing publication
(drop publication and recreate after a varying number of records to
check the impact).

Since we don't have an exact scenario to compare with the patch
(because, in the current HEAD, when the publication is missing, an
error is thrown and the walsender/worker restarts), I compared the
positive case, where records are successfully replicated to the
subscriber, as shown below. For the scenario with the patch, I ran the
same test, where the publication is dropped before the insert,
allowing the walsender to check whether the publication is present.
The test results, which represent the median of 7 runs and the
execution run is in milliseconds, are provided below:

Brach/records | 100 | 1000 | 10000 | 100000 | 1000000
Head | 1.214 | 2.548 | 10.823 | 90.3 | 951.833
Patch | 1.215 | 2.5485 | 10.8545 | 90.94 | 955.134
% diff | 0.082 | 0.020 | 0.291 | 0.704 | 0.347

I noticed that the test run with patches is very negligible. The
scripts used for execution are attached.

You have used the synchronous_standby_name to evaluate the performance
which covers other parts of replication than the logical decoding. It
would be better to test using pg_recvlogical.

Here are the test runs with pg_recvlogical, the test results, which
represent the median of 10 runs and the execution run is in
milliseconds, are provided below:
Brach/records | 100 | 1000 | 10000 | 100000 | 1000000
Head | 9.95 | 15.26 | 62.62 | 536.57 | 8480.83
Patch | 9.218 | 10.32 | 23.05 | 143.83 | 4852.43
% diff | 7.356 | 32.38 | 63.19 | 73.193| 42.783

We observe that test execution with the patch performs better between
7.35 percent to 73.19 percent. This is because, in HEAD, after loading
and verifying that the publication is valid, it must continue
processing to output the change. In contrast, with the patch,
outputting the change is skipped since the publication does not exist.

The attached script has the script that was used for testing. Here the
NUM_RECORDS count should be changed accordingly for each of the tests
and while running the test with the patch change uncomment the drop
publication command.

Regards,
Vignesh

Attachments:

test_pg_recvlogical.shtext/x-sh; charset=US-ASCII; name=test_pg_recvlogical.shDownload
#7vignesh C
vignesh21@gmail.com
In reply to: vignesh C (#6)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Tue, 25 Feb 2025 at 15:32, vignesh C <vignesh21@gmail.com> wrote:

The attached script has the script that was used for testing. Here the
NUM_RECORDS count should be changed accordingly for each of the tests
and while running the test with the patch change uncomment the drop
publication command.

I have done further analysis on the test and changed the test to
compare it better with HEAD. The execution time is in milliseconds.
Brach/records | 100 | 1000 | 10000 | 100000 | 1000000
Head | 10.43 | 15.86 | 64.44 | 550.56 | 8991.04
Patch | 11.35 | 17.26 | 73.50 | 640.21 | 10104.72
% diff | -8.82 | -8.85 | -14.08 | -16.28 | -12.38

There is a performance degradation in the range of 8.8 to 16.2 percent.

Test Details (With Head):
a) Create two publications for the same table. b) Insert the records
listed in the table below. c) Use pg_recvlogical to capture the
changes.
Test Details (With Patch):
a) Create two publications for the same table.b) Drop one
publication(to check the impact of skip missing publication), ensuring
that changes from the remaining publication continue to be captured.
c) Insert the records listed in the table below.d) Use pg_recvlogical
to capture the changes.

The performance degradation is in the range of 8.8 to 16.2
percentage.The script used for the testing is attached, while running
with patch the drop publication command in script should be
uncommented and the record count should be changed for each of the
run. Also I changed the patch so that we need not execute the
get_rel_sync_entry code flow for every record in case of missing
publication case and to do so only in case when the publications have
changed. The attached v4 version patch has the changes for the same.

Regards,
Vignesh

Attachments:

v4-0001-Fix-logical-replication-breakage-after.patchapplication/octet-stream; name=v4-0001-Fix-logical-replication-breakage-after.patchDownload+66-9
test_pg_recvlogical.shtext/x-sh; charset=US-ASCII; name=test_pg_recvlogical.shDownload
#8Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#7)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Mon, Mar 3, 2025 at 2:30 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 25 Feb 2025 at 15:32, vignesh C <vignesh21@gmail.com> wrote:

The attached script has the script that was used for testing. Here the
NUM_RECORDS count should be changed accordingly for each of the tests
and while running the test with the patch change uncomment the drop
publication command.

I have done further analysis on the test and changed the test to
compare it better with HEAD. The execution time is in milliseconds.
Brach/records | 100 | 1000 | 10000 | 100000 | 1000000
Head | 10.43 | 15.86 | 64.44 | 550.56 | 8991.04
Patch | 11.35 | 17.26 | 73.50 | 640.21 | 10104.72
% diff | -8.82 | -8.85 | -14.08 | -16.28 | -12.38

There is a performance degradation in the range of 8.8 to 16.2 percent.

- /* Validate the entry */
- if (!entry->replicate_valid)
+ /*
+ * If the publication is invalid, check for updates.
+ * This optimization ensures that the next block, which queries the system
+ * tables and builds the relation entry, runs only if a new publication was
+ * created.
+ */
+ if (!publications_valid && data->publications)
+ {
+ bool skipped_pub = false;
+ List    *publications;
+
+ publications = LoadPublications(data->publication_names, &skipped_pub);

The publications_valid flag indicates whether the publications cache
is valid or not; the flag is set to false for any invalidation in the
pg_publication catalog. I wonder that instead of using the same flag
what if we use a separate publications_skipped flag? If that works,
you don't even need to change the current location where we
LoadPublications.

--
With Regards,
Amit Kapila.

#9vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#8)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Mon, 3 Mar 2025 at 16:41, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 3, 2025 at 2:30 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 25 Feb 2025 at 15:32, vignesh C <vignesh21@gmail.com> wrote:

The attached script has the script that was used for testing. Here the
NUM_RECORDS count should be changed accordingly for each of the tests
and while running the test with the patch change uncomment the drop
publication command.

I have done further analysis on the test and changed the test to
compare it better with HEAD. The execution time is in milliseconds.
Brach/records | 100 | 1000 | 10000 | 100000 | 1000000
Head | 10.43 | 15.86 | 64.44 | 550.56 | 8991.04
Patch | 11.35 | 17.26 | 73.50 | 640.21 | 10104.72
% diff | -8.82 | -8.85 | -14.08 | -16.28 | -12.38

There is a performance degradation in the range of 8.8 to 16.2 percent.

- /* Validate the entry */
- if (!entry->replicate_valid)
+ /*
+ * If the publication is invalid, check for updates.
+ * This optimization ensures that the next block, which queries the system
+ * tables and builds the relation entry, runs only if a new publication was
+ * created.
+ */
+ if (!publications_valid && data->publications)
+ {
+ bool skipped_pub = false;
+ List    *publications;
+
+ publications = LoadPublications(data->publication_names, &skipped_pub);

The publications_valid flag indicates whether the publications cache
is valid or not; the flag is set to false for any invalidation in the
pg_publication catalog. I wonder that instead of using the same flag
what if we use a separate publications_skipped flag? If that works,
you don't even need to change the current location where we
LoadPublications.

There is almost negligible dip with the above suggested way, the test
results for the same is given below(execution time is in milli
seconds):
Brach/records | 100 | 1000 | 10000 | 100000 | 1000000
Head | 10.25 | 15.85 | 65.53 | 569.15 | 9194.19
Patch | 10.25 | 15.84 | 65.91 | 571.75 | 9208.66
% diff | 0.00 | 0.06 | -0.58 | -0.46 | -0.16

There is a performance dip in the range of 0 to 0.58 percent.
The attached patch has the changes for the same. The test script used
is also attached.

Regards,
Vignesh

Attachments:

v5-0001-Fix-logical-replication-breakage-after.patchapplication/x-patch; name=v5-0001-Fix-logical-replication-breakage-after.patchDownload+33-9
test_pg_recvlogical.shtext/x-sh; charset=US-ASCII; name=test_pg_recvlogical.shDownload
#10vignesh C
vignesh21@gmail.com
In reply to: vignesh C (#9)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Tue, 4 Mar 2025 at 12:22, vignesh C <vignesh21@gmail.com> wrote:

On Mon, 3 Mar 2025 at 16:41, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 3, 2025 at 2:30 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 25 Feb 2025 at 15:32, vignesh C <vignesh21@gmail.com> wrote:

The attached script has the script that was used for testing. Here the
NUM_RECORDS count should be changed accordingly for each of the tests
and while running the test with the patch change uncomment the drop
publication command.

I have done further analysis on the test and changed the test to
compare it better with HEAD. The execution time is in milliseconds.
Brach/records | 100 | 1000 | 10000 | 100000 | 1000000
Head | 10.43 | 15.86 | 64.44 | 550.56 | 8991.04
Patch | 11.35 | 17.26 | 73.50 | 640.21 | 10104.72
% diff | -8.82 | -8.85 | -14.08 | -16.28 | -12.38

There is a performance degradation in the range of 8.8 to 16.2 percent.

- /* Validate the entry */
- if (!entry->replicate_valid)
+ /*
+ * If the publication is invalid, check for updates.
+ * This optimization ensures that the next block, which queries the system
+ * tables and builds the relation entry, runs only if a new publication was
+ * created.
+ */
+ if (!publications_valid && data->publications)
+ {
+ bool skipped_pub = false;
+ List    *publications;
+
+ publications = LoadPublications(data->publication_names, &skipped_pub);

The publications_valid flag indicates whether the publications cache
is valid or not; the flag is set to false for any invalidation in the
pg_publication catalog. I wonder that instead of using the same flag
what if we use a separate publications_skipped flag? If that works,
you don't even need to change the current location where we
LoadPublications.

There is almost negligible dip with the above suggested way, the test
results for the same is given below(execution time is in milli
seconds):
Brach/records | 100 | 1000 | 10000 | 100000 | 1000000
Head | 10.25 | 15.85 | 65.53 | 569.15 | 9194.19
Patch | 10.25 | 15.84 | 65.91 | 571.75 | 9208.66
% diff | 0.00 | 0.06 | -0.58 | -0.46 | -0.16

There is a performance dip in the range of 0 to 0.58 percent.
The attached patch has the changes for the same. The test script used
is also attached.

On further thinking, I felt the use of publications_updated variable
is not required we can use publications_valid itself which will be set
if the publication system table is invalidated. Here is a patch for
the same.

Regards,
Vignesh

Attachments:

v6-0001-Fix-logical-replication-breakage-after-ALTER-SUBS.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Fix-logical-replication-breakage-after-ALTER-SUBS.patchDownload+12-5
#11Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#9)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Tue, Mar 4, 2025 at 12:23 PM vignesh C <vignesh21@gmail.com> wrote:

There is almost negligible dip with the above suggested way, the test
results for the same is given below(execution time is in milli
seconds):
Brach/records | 100 | 1000 | 10000 | 100000 | 1000000
Head | 10.25 | 15.85 | 65.53 | 569.15 | 9194.19
Patch | 10.25 | 15.84 | 65.91 | 571.75 | 9208.66
% diff | 0.00 | 0.06 | -0.58 | -0.46 | -0.16

There is a performance dip in the range of 0 to 0.58 percent.
The attached patch has the changes for the same. The test script used
is also attached.

The patch still needs more review but the change has negligible
performance impact. The next step is to get more opinions on whether
we should add a new subscription option (say
skip_not_existant_publication) for this work. See patch v1-0002-* in
email [1]/messages/by-id/CALDaNm0-n8FGAorM+bTxkzn+AOUyx5=L_XmnvOP6T24+-NcBKg@mail.gmail.com. The problem summary is explained in email [2]/messages/by-id/CAA4eK1Lc=NDV1HrY2gNasFK90MtysnA575a+rd0p+POjXN+Spw@mail.gmail.com and in the
commit message of the 0001 patch in this thread. But still, let me
write briefly for the ease of others.

The problem is that ALTER SUBSCRIPTION ... SET PUBLICATION ... will
lead to restarting of apply worker, and after the restart, the apply
worker will use the existing slot and replication origin corresponding
to the subscription. Now, it is possible that before the restart, the
origin has not been updated, and the WAL start location points to a
location before where PUBLICATION pointed to by SET PUBLICATION
exists. This leads to an error: "ERROR: publication "pub1" does not
exist". Once this error occurs, apply worker will never be able to
proceed and will always return the same error. For users, this is a
problem because they would have created a publication before executing
ALTER SUBSCRIPTION ... SET PUBLICATION .. and now they have no way to
proceed.

The solution we came up with is to skip loading the publication if the
publication does not exist. We load the publication later and update
the relation entry when the publication gets created.

The two main concerns with this idea, as shared in email [3]/messages/by-id/dc08add3-10a8-738b-983a-191c7406707b@enterprisedb.com, are
performance implications of this change and the possibility of current
behaviour expectations from the users.

We came up with a solution where the performance impact is negligible,
as shown in the tests [4]/messages/by-id/CALDaNm2Xkm1M-ik2RLJZ9rMhW2zW2GRLL6ePyZJbXcAjOVwzXg@mail.gmail.com. For that, we won't try to reload the
skipped/missing publication for each change but will attempt it only
when any new publication is created/dropped for a valid relation entry
in RelationSyncCache (maintained by pgoutput).

The new option skip_not_existant_publication is to address the second
concern "Imagine you have a subscriber using two publications p1 and
p2, and someone comes around and drops p1 by mistake. With the
proposed patch, the subscription will notice this, but it'll continue
sending data ignoring the missing publication. Yes, it will continue
working, but it's quite possible this breaks the subscriber and it's
be better to fail and stop replicating.".

I see the point of adding such an option to avoid breaking the current
applications (if there are any) that are relying on current behaviour.
But OTOH, I am not sure if users expect us to fail explicitly in such
scenarios.

This is a long-standing behaviour for which we get reports from time
to time, and once analyzing a failure, Tom also looked at it and
agreed that we don't have much choice to avoid skipping non-existent
publications [5]/messages/by-id/631312.1707251789@sss.pgh.pa.us. But we never concluded as to whether skipping should
be a default behavior or an optional one. So, we need more opinions on
it.

Thoughts?

[1]: /messages/by-id/CALDaNm0-n8FGAorM+bTxkzn+AOUyx5=L_XmnvOP6T24+-NcBKg@mail.gmail.com
[2]: /messages/by-id/CAA4eK1Lc=NDV1HrY2gNasFK90MtysnA575a+rd0p+POjXN+Spw@mail.gmail.com
[3]: /messages/by-id/dc08add3-10a8-738b-983a-191c7406707b@enterprisedb.com
[4]: /messages/by-id/CALDaNm2Xkm1M-ik2RLJZ9rMhW2zW2GRLL6ePyZJbXcAjOVwzXg@mail.gmail.com
[5]: /messages/by-id/631312.1707251789@sss.pgh.pa.us

--
With Regards,
Amit Kapila.

#12Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#11)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Tue, Mar 4, 2025 at 9:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 4, 2025 at 12:23 PM vignesh C <vignesh21@gmail.com> wrote:

There is almost negligible dip with the above suggested way, the test
results for the same is given below(execution time is in milli
seconds):
Brach/records | 100 | 1000 | 10000 | 100000 | 1000000
Head | 10.25 | 15.85 | 65.53 | 569.15 | 9194.19
Patch | 10.25 | 15.84 | 65.91 | 571.75 | 9208.66
% diff | 0.00 | 0.06 | -0.58 | -0.46 | -0.16

There is a performance dip in the range of 0 to 0.58 percent.
The attached patch has the changes for the same. The test script used
is also attached.

The patch still needs more review but the change has negligible
performance impact. The next step is to get more opinions on whether
we should add a new subscription option (say
skip_not_existant_publication) for this work. See patch v1-0002-* in
email [1]. The problem summary is explained in email [2] and in the
commit message of the 0001 patch in this thread. But still, let me
write briefly for the ease of others.

The problem is that ALTER SUBSCRIPTION ... SET PUBLICATION ... will
lead to restarting of apply worker, and after the restart, the apply
worker will use the existing slot and replication origin corresponding
to the subscription. Now, it is possible that before the restart, the
origin has not been updated, and the WAL start location points to a
location before where PUBLICATION pointed to by SET PUBLICATION
exists. This leads to an error: "ERROR: publication "pub1" does not
exist". Once this error occurs, apply worker will never be able to
proceed and will always return the same error. For users, this is a
problem because they would have created a publication before executing
ALTER SUBSCRIPTION ... SET PUBLICATION .. and now they have no way to
proceed.

The solution we came up with is to skip loading the publication if the
publication does not exist. We load the publication later and update
the relation entry when the publication gets created.

The two main concerns with this idea, as shared in email [3], are
performance implications of this change and the possibility of current
behaviour expectations from the users.

We came up with a solution where the performance impact is negligible,
as shown in the tests [4]. For that, we won't try to reload the
skipped/missing publication for each change but will attempt it only
when any new publication is created/dropped for a valid relation entry
in RelationSyncCache (maintained by pgoutput).

Thank you for summarizing the issue. That helps catch up a lot.

The new option skip_not_existant_publication is to address the second
concern "Imagine you have a subscriber using two publications p1 and
p2, and someone comes around and drops p1 by mistake. With the
proposed patch, the subscription will notice this, but it'll continue
sending data ignoring the missing publication. Yes, it will continue
working, but it's quite possible this breaks the subscriber and it's
be better to fail and stop replicating.".

I think that in this particular situation the current behavior would
be likely to miss more changes than the patch'ed behavior case.

After the logical replication stops, the user would have to alter the
subscription to subscribe to only p1 by executing 'ALTER SUBSCRIPTION
... SET PUBLICATION p1' in order to resume the logical replication. In
any case, the publisher might be receiving further changes but the
subscriber would end up missing changes for tables associated with p2
generated while p2 doesn't exist. Even if the user re-creates the
publication p2 after that, it would be hard for users to re-alter the
subscription to get changes for tables associated with p1 and p2 from
the exact point of p2 being created. Therefore, the subscriber could
end up missing some changes that happened between 'CREATE PUBLICATION
p2' and 'ALTER SUBSCRIPTION ... SET PUBLICATION p1, p2'.

On the other hand, with the patch, the publication can send the
changes to tables associated with p1 and p2 as soon as it decodes the
WAL record of re-CREATE PUBLICATION p2.

I see the point of adding such an option to avoid breaking the current
applications (if there are any) that are relying on current behaviour.
But OTOH, I am not sure if users expect us to fail explicitly in such
scenarios.

On the side note, with the patch since we ignore missing publications
we will be able to create a subscription with whatever publications
regardless their existence like:

CREATE SUBSCRIPTION ... PUBLICATION pub1, pub2, pub3, pub4, pub5, ..., pub1000;

The walsender corresponding to such subscriber can stream changes as
soon as the listed publications are created on the publisher and
REFRESH PUBLICATION is executed.

This is a long-standing behaviour for which we get reports from time
to time, and once analyzing a failure, Tom also looked at it and
agreed that we don't have much choice to avoid skipping non-existent
publications [5]. But we never concluded as to whether skipping should
be a default behavior or an optional one. So, we need more opinions on
it.

I'm leaning toward making the skipping behavior a default as I could
not find a good benefit for the current behavior (i.e., stopping
logical replication due to missing publications).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#12)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Sun, Mar 9, 2025 at 9:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 4, 2025 at 9:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I see the point of adding such an option to avoid breaking the current
applications (if there are any) that are relying on current behaviour.
But OTOH, I am not sure if users expect us to fail explicitly in such
scenarios.

On the side note, with the patch since we ignore missing publications
we will be able to create a subscription with whatever publications
regardless their existence like:

CREATE SUBSCRIPTION ... PUBLICATION pub1, pub2, pub3, pub4, pub5, ..., pub1000;

The walsender corresponding to such subscriber can stream changes as
soon as the listed publications are created on the publisher and
REFRESH PUBLICATION is executed.

Right, but OTOH, one can expect that the data should start replicating
as soon as one creates a publication on the publisher. However, the
data for tables that are part of the publication will start
replicating from the point when the decoding process will process the
WAL corresponding to Create Publication. I suggest to add something
for this in docs unless it is already explained.

This is a long-standing behaviour for which we get reports from time
to time, and once analyzing a failure, Tom also looked at it and
agreed that we don't have much choice to avoid skipping non-existent
publications [5]. But we never concluded as to whether skipping should
be a default behavior or an optional one. So, we need more opinions on
it.

I'm leaning toward making the skipping behavior a default as I could
not find a good benefit for the current behavior (i.e., stopping
logical replication due to missing publications).

Sounds reasonable. We can always add the option at a later point if
required. Thanks for your input. We can continue reviewing and
committing the current patch.

--
With Regards,
Amit Kapila.

#14Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#10)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

On further thinking, I felt the use of publications_updated variable
is not required we can use publications_valid itself which will be set
if the publication system table is invalidated. Here is a patch for
the same.

The patch relies on the fact that whenever a publication's data is
invalidated, it will also invalidate all the RelSyncEntires as per
publication_invalidation_cb. But note that we are discussing removing
that inefficiency in the thread [1]/messages/by-id/OSCPR01MB14966C09AA201EFFA706576A7F5C92@OSCPR01MB14966.jpnprd01.prod.outlook.com. So, we should try to rebuild the
entry when we have skipped the required publication previously.

Apart from this, please consider updating the docs, as mentioned in my
response to Sawada-San's email.

BTW, I am planning to commit this only on HEAD as this is a behavior
change. Please let me know if you guys think otherwise.

[1]: /messages/by-id/OSCPR01MB14966C09AA201EFFA706576A7F5C92@OSCPR01MB14966.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.

#15Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#14)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Mon, Mar 10, 2025 at 9:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

On further thinking, I felt the use of publications_updated variable
is not required we can use publications_valid itself which will be set
if the publication system table is invalidated. Here is a patch for
the same.

The patch relies on the fact that whenever a publication's data is
invalidated, it will also invalidate all the RelSyncEntires as per
publication_invalidation_cb. But note that we are discussing removing
that inefficiency in the thread [1]. So, we should try to rebuild the
entry when we have skipped the required publication previously.

Apart from this, please consider updating the docs, as mentioned in my
response to Sawada-San's email.

I'm not sure I fully understand it, but based on your previous email and
the initial email from Vignesh, if IIUC, the issue occurs when a
publication is created after a certain LSN. When ALTER SUBSCRIPTION ... SET
PUBLICATION is executed, the subscriber workers restart and request the
changes based on restart_lsn, which is at an earlier LSN in the WAL than
the LSN at which the publication was created. This leads to an error, and
we are addressing this behavior as part of the fix by skipping the changes
which are between the restart_lsn of subscriber and the lsn at which
publication is created and this behavior looks fine.

BTW, I am planning to commit this only on HEAD as this is a behavior

change. Please let me know if you guys think otherwise.

Somehow this looks like a bug fix which should be backported no? Am I
missing something?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#15)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Mon, Mar 10, 2025 at 10:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Mar 10, 2025 at 9:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

On further thinking, I felt the use of publications_updated variable
is not required we can use publications_valid itself which will be set
if the publication system table is invalidated. Here is a patch for
the same.

The patch relies on the fact that whenever a publication's data is
invalidated, it will also invalidate all the RelSyncEntires as per
publication_invalidation_cb. But note that we are discussing removing
that inefficiency in the thread [1]. So, we should try to rebuild the
entry when we have skipped the required publication previously.

Apart from this, please consider updating the docs, as mentioned in my
response to Sawada-San's email.

I'm not sure I fully understand it, but based on your previous email and the initial email from Vignesh, if IIUC, the issue occurs when a publication is created after a certain LSN. When ALTER SUBSCRIPTION ... SET PUBLICATION is executed, the subscriber workers restart and request the changes based on restart_lsn, which is at an earlier LSN in the WAL than the LSN at which the publication was created. This leads to an error, and we are addressing this behavior as part of the fix by skipping the changes which are between the restart_lsn of subscriber and the lsn at which publication is created and this behavior looks fine.

Yes, your understanding is correct, but note that as such, the patch
is simply skipping the missing publication. The skipped changes are
because those were on the table that is not part of any publication
w.r.t historic snapshot we have at the point of time.

BTW, I am planning to commit this only on HEAD as this is a behavior
change. Please let me know if you guys think otherwise.

Somehow this looks like a bug fix which should be backported no? Am I missing something?

We can consider this a bug-fix and backpatch it, but I am afraid that
because this is a behavior change, some users may not like it. Also, I
don't remember seeing public reports for this behavior; that is
probably because it is hard to hit. FYI, we found this via BF
failures. So, I thought it would be better to get this field tested
via HEAD only at this point in time.

--
With Regards,
Amit Kapila.

#17Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#16)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Mon, Mar 10, 2025 at 10:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 10, 2025 at 10:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Mar 10, 2025 at 9:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

On further thinking, I felt the use of publications_updated variable
is not required we can use publications_valid itself which will be set
if the publication system table is invalidated. Here is a patch for
the same.

The patch relies on the fact that whenever a publication's data is
invalidated, it will also invalidate all the RelSyncEntires as per
publication_invalidation_cb. But note that we are discussing removing
that inefficiency in the thread [1]. So, we should try to rebuild the
entry when we have skipped the required publication previously.

Apart from this, please consider updating the docs, as mentioned in my
response to Sawada-San's email.

I'm not sure I fully understand it, but based on your previous email and the initial email from Vignesh, if IIUC, the issue occurs when a publication is created after a certain LSN. When ALTER SUBSCRIPTION ... SET PUBLICATION is executed, the subscriber workers restart and request the changes based on restart_lsn, which is at an earlier LSN in the WAL than the LSN at which the publication was created. This leads to an error, and we are addressing this behavior as part of the fix by skipping the changes which are between the restart_lsn of subscriber and the lsn at which publication is created and this behavior looks fine.

Yes, your understanding is correct, but note that as such, the patch
is simply skipping the missing publication. The skipped changes are
because those were on the table that is not part of any publication
w.r.t historic snapshot we have at the point of time.

So, it will skip loading the missing publication up to the LSN where
the publication is created and then load it from there, correct? Do we
have a test case for this? I couldn't find one in the latest patch or
in the email thread to demonstrate this behavior.

BTW, I am planning to commit this only on HEAD as this is a behavior
change. Please let me know if you guys think otherwise.

Somehow this looks like a bug fix which should be backported no? Am I missing something?

We can consider this a bug-fix and backpatch it, but I am afraid that
because this is a behavior change, some users may not like it. Also, I
don't remember seeing public reports for this behavior; that is
probably because it is hard to hit. FYI, we found this via BF
failures. So, I thought it would be better to get this field tested
via HEAD only at this point in time.

At the moment, I don't have a strong opinion on this. Since no one has
encountered or reported this issue, it might be the case that it's not
affecting anyone, and we could simply backpatch without causing any
dissatisfaction. However, I'm fine with whatever others decide.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#18vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#14)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Mon, 10 Mar 2025 at 09:33, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

On further thinking, I felt the use of publications_updated variable
is not required we can use publications_valid itself which will be set
if the publication system table is invalidated. Here is a patch for
the same.

The patch relies on the fact that whenever a publication's data is
invalidated, it will also invalidate all the RelSyncEntires as per
publication_invalidation_cb. But note that we are discussing removing
that inefficiency in the thread [1]. So, we should try to rebuild the
entry when we have skipped the required publication previously.

Apart from this, please consider updating the docs, as mentioned in my
response to Sawada-San's email.

The create subscription documentation already has "We allow
non-existent publications to be specified so that users can add those
later. This means pg_subscription can have non-existent publications."
and should be enough at [1]https://www.postgresql.org/docs/devel/sql-createsubscription.html. Let me know if we need to add more
documentation.

Apart from this I have changed the log level that logs "skipped
loading publication" to WARNING as we log a warning "WARNING:
publications "pub2", "pub3" do not exist on the publisher" in case of
CREATE SUBSCRIPTION and looked similar to this. I can change it to a
different log level in case you feel this is not the right level.

Also I have added a test case for dilip's comment from [2]/messages/by-id/CAFiTN-tgUR6QLSs3UHK7gx4VP7cURGNkufA_xkrQLw9eCnbGQw@mail.gmail.com.
The attached v7 version patch has the changes for the same.

[1]: https://www.postgresql.org/docs/devel/sql-createsubscription.html
[2]: /messages/by-id/CAFiTN-tgUR6QLSs3UHK7gx4VP7cURGNkufA_xkrQLw9eCnbGQw@mail.gmail.com

Regards,
Vignesh

Attachments:

v7-0001-Fix-logical-replication-breakage-after-ALTER-SUBS.patchapplication/octet-stream; name=v7-0001-Fix-logical-replication-breakage-after-ALTER-SUBS.patchDownload+52-4
#19Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#17)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Tue, Mar 11, 2025 at 9:48 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Mar 10, 2025 at 10:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

BTW, I am planning to commit this only on HEAD as this is a behavior
change. Please let me know if you guys think otherwise.

Somehow this looks like a bug fix which should be backported no? Am I missing something?

We can consider this a bug-fix and backpatch it, but I am afraid that
because this is a behavior change, some users may not like it. Also, I
don't remember seeing public reports for this behavior; that is
probably because it is hard to hit. FYI, we found this via BF
failures. So, I thought it would be better to get this field tested
via HEAD only at this point in time.

At the moment, I don't have a strong opinion on this. Since no one has
encountered or reported this issue, it might be the case that it's not
affecting anyone, and we could simply backpatch without causing any
dissatisfaction. However, I'm fine with whatever others decide.

Sawada-San, others, do you have an opinion on whether to backpatch this change?

--
With Regards,
Amit Kapila.

#20Dilip Kumar
dilipbalaut@gmail.com
In reply to: vignesh C (#18)
Re: Add an option to skip loading missing publication to avoid logical replication failure

On Tue, Mar 11, 2025 at 4:01 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, 10 Mar 2025 at 09:33, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:

On further thinking, I felt the use of publications_updated variable
is not required we can use publications_valid itself which will be set
if the publication system table is invalidated. Here is a patch for
the same.

The patch relies on the fact that whenever a publication's data is
invalidated, it will also invalidate all the RelSyncEntires as per
publication_invalidation_cb. But note that we are discussing removing
that inefficiency in the thread [1]. So, we should try to rebuild the
entry when we have skipped the required publication previously.

Apart from this, please consider updating the docs, as mentioned in my
response to Sawada-San's email.

The create subscription documentation already has "We allow
non-existent publications to be specified so that users can add those
later. This means pg_subscription can have non-existent publications."
and should be enough at [1]. Let me know if we need to add more
documentation.

Apart from this I have changed the log level that logs "skipped
loading publication" to WARNING as we log a warning "WARNING:
publications "pub2", "pub3" do not exist on the publisher" in case of
CREATE SUBSCRIPTION and looked similar to this. I can change it to a
different log level in case you feel this is not the right level.

Also I have added a test case for dilip's comment from [2].
The attached v7 version patch has the changes for the same.

Thanks, Vignesh, for adding the test. I believe you've tested the
effect of DROP PUBLICATION. However, I think we should also test the
behavior of ALTER SUBSCRIPTION...SET PUBLICATION before creating the
PUBLICATION, and then create the PUBLICATION at a later stage.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#21Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#19)
#22vignesh C
vignesh21@gmail.com
In reply to: Dilip Kumar (#20)
#23Dilip Kumar
dilipbalaut@gmail.com
In reply to: vignesh C (#22)
#24vignesh C
vignesh21@gmail.com
In reply to: Dilip Kumar (#23)
#25Dilip Kumar
dilipbalaut@gmail.com
In reply to: vignesh C (#24)
#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#25)
#27Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#26)
#28Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#28)
#30Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#29)
#31vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#30)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: vignesh C (#31)
#33Xuneng Zhou
xunengzhou@gmail.com
In reply to: Tom Lane (#32)
#34vignesh C
vignesh21@gmail.com
In reply to: Tom Lane (#32)
#35Xuneng Zhou
xunengzhou@gmail.com
In reply to: vignesh C (#31)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Xuneng Zhou (#35)
#37vignesh C
vignesh21@gmail.com
In reply to: Xuneng Zhou (#35)
#38Xuneng Zhou
xunengzhou@gmail.com
In reply to: vignesh C (#37)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: vignesh C (#37)
#40vignesh C
vignesh21@gmail.com
In reply to: vignesh C (#34)
#41Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#39)
#42Xuneng Zhou
xunengzhou@gmail.com
In reply to: Amit Kapila (#41)
#43Amit Kapila
amit.kapila16@gmail.com
In reply to: Xuneng Zhou (#42)
#44Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#43)
#45Xuneng Zhou
xunengzhou@gmail.com
In reply to: vignesh C (#40)
#46Amit Kapila
amit.kapila16@gmail.com
In reply to: Xuneng Zhou (#45)
#47Xuneng Zhou
xunengzhou@gmail.com
In reply to: Amit Kapila (#46)
#48Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#44)