Optionally automatically disable logical replication subscriptions on error

Started by Mark Dilgerover 4 years ago133 messages
#1Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
1 attachment(s)

Hackers,

Logical replication apply workers for a subscription can easily get stuck in an infinite loop of attempting to apply a change, triggering an error (such as a constraint violation), exiting with an error written to the subscription worker log, and restarting.

As things currently stand, only superusers can create subscriptions. Ongoing work to delegate superuser tasks to non-superusers creates the potential for even more errors to be triggered, specifically, errors where the apply worker does not have permission to make changes to the target table.

The attached patch makes it possible to create a subscription using a new subscription_parameter, "disable_on_error", such that rather than going into an infinite loop, the apply worker will catch errors and automatically disable the subscription, breaking the loop. The new parameter defaults to false. When false, the PG_TRY/PG_CATCH overhead is avoided, so for subscriptions which do not use the feature, there shouldn't be any change. Users can manually clear the error after fixing the underlying issue with an ALTER SUBSCRIPTION .. ENABLE command.

In addition to helping on production systems, this makes writing TAP tests involving error conditions simpler. I originally ran into the motivation to write this patch when frustrated that TAP tests needed to parse the apply worker log file to determine whether permission failures were occurring and what they were. It was also obnoxiously easy to have a test get stuck waiting for a permanently stuck subscription to catch up. This helps with both issues.

I don't think this is quite ready for commit, but I'd like feedback if folks like this idea or want to suggest design changes.

Attachments:

v1-0001-Optionally-disabling-subscriptions-on-error.patchapplication/octet-stream; name=v1-0001-Optionally-disabling-subscriptions-on-error.patch; x-unix-mode=0644
#2Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Dilger (#1)
Re: Optionally automatically disable logical replication subscriptions on error

On Fri, Jun 18, 2021 at 1:48 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Hackers,

Logical replication apply workers for a subscription can easily get stuck in an infinite loop of attempting to apply a change, triggering an error (such as a constraint violation), exiting with an error written to the subscription worker log, and restarting.

As things currently stand, only superusers can create subscriptions. Ongoing work to delegate superuser tasks to non-superusers creates the potential for even more errors to be triggered, specifically, errors where the apply worker does not have permission to make changes to the target table.

The attached patch makes it possible to create a subscription using a new subscription_parameter, "disable_on_error", such that rather than going into an infinite loop, the apply worker will catch errors and automatically disable the subscription, breaking the loop. The new parameter defaults to false. When false, the PG_TRY/PG_CATCH overhead is avoided, so for subscriptions which do not use the feature, there shouldn't be any change. Users can manually clear the error after fixing the underlying issue with an ALTER SUBSCRIPTION .. ENABLE command.

I see this idea has merits and it will help users to repair failing
subscriptions. Few points on a quick look at the patch: (a) The patch
seem to be assuming that the error can happen only by the apply worker
but I think the constraint violation can happen via one of the table
sync workers as well, (b) What happens if the error happens when you
are updating the error information in the catalog table. I think
instead of seeing the actual apply time error, the user might see some
other for which it won't be clear what is an appropriate action.

We are also discussing another action like skipping the apply of the
transaction on an error [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com. I think it is better to evaluate both the
proposals as one seems to be an extension of another. Adding
Sawada-San, as he is working on the other proposal.

[1]: /messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com

--
With Regards,
Amit Kapila.

#3Peter Smith
Peter Smith
smithpb2250@gmail.com
In reply to: Mark Dilger (#1)
Re: Optionally automatically disable logical replication subscriptions on error

On Fri, Jun 18, 2021 at 6:18 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Hackers,

Logical replication apply workers for a subscription can easily get stuck in an infinite loop of attempting to apply a change, triggering an error (such as a constraint violation), exiting with an error written to the subscription worker log, and restarting.

As things currently stand, only superusers can create subscriptions. Ongoing work to delegate superuser tasks to non-superusers creates the potential for even more errors to be triggered, specifically, errors where the apply worker does not have permission to make changes to the target table.

The attached patch makes it possible to create a subscription using a new subscription_parameter, "disable_on_error", such that rather than going into an infinite loop, the apply worker will catch errors and automatically disable the subscription, breaking the loop. The new parameter defaults to false. When false, the PG_TRY/PG_CATCH overhead is avoided, so for subscriptions which do not use the feature, there shouldn't be any change. Users can manually clear the error after fixing the underlying issue with an ALTER SUBSCRIPTION .. ENABLE command.

In addition to helping on production systems, this makes writing TAP tests involving error conditions simpler. I originally ran into the motivation to write this patch when frustrated that TAP tests needed to parse the apply worker log file to determine whether permission failures were occurring and what they were. It was also obnoxiously easy to have a test get stuck waiting for a permanently stuck subscription to catch up. This helps with both issues.

I don't think this is quite ready for commit, but I'd like feedback if folks like this idea or want to suggest design changes.

I tried your patch.

It applied OK (albeit with whitespace warnings).

The code build and TAP tests are all OK.

Below are a few comments and observations.

COMMENTS
========

(1) PG Docs catalogs.sgml

Documented new column "suberrmsg" but did not document the other new
columns ("disable_on_error", "disabled_by_error")?

------

(2) New column "disabled_by_error".

I wondered if there was actually any need for this column. Isn't the
same information conveyed by just having "subenabled" = false, at same
time as as non-empty "suberrmsg"? This would remove any confusion for
having 2 booleans which both indicate disabled.

------

(3) New columns "disabled_by_error", "disabled_on_error".

All other columns of the pg_subscription have a "sub" prefix.

------

(4) errhint member used?

@@ -91,12 +100,16 @@ typedef struct Subscription
  char    *name; /* Name of the subscription */
  Oid owner; /* Oid of the subscription owner */
  bool enabled; /* Indicates if the subscription is enabled */
+ bool disable_on_error; /* Whether errors automatically disable */
+ bool disabled_by_error; /* Whether an error has disabled */
  bool binary; /* Indicates if the subscription wants data in
  * binary format */
  bool stream; /* Allow streaming in-progress transactions. */
  char    *conninfo; /* Connection string to the publisher */
  char    *slotname; /* Name of the replication slot */
  char    *synccommit; /* Synchronous commit setting for worker */
+ char    *errmsg; /* Message from error which disabled */
+ char    *errhint; /* Hint from error which disabled */
  List    *publications; /* List of publication names to subscribe to */
 } Subscription;

I did not find any code using that newly added member "errhint".

------

(5) dump.c

i. No mention of new columns "disabled_on_error" and
"disabled_by_error". Is that right?

ii. Shouldn't the code for the "suberrmsg" be qualified with some PG
version number checks?

------

(6) Patch only handles errors only from the Apply worker.

Tablesync can give similar errors (e.g. PK violation during DATASYNC
phase) which will trigger re-launch forever regardless of the setting
of "disabled_on_error".
(confirmed by observations below)

------

(7) TAP test code

+$node_subscriber->init(allows_streaming => 'logical');

AFAIK that "logical" configuration is not necessary for the subscriber side. So,

$node_subscriber->init();

////////////

Some Experiments/Observations
==============================

In general, I found this functionality is useful and it works as
advertised by your patch comment.

======

Test: Display pg_subscription with the new columns
Observation: As expected. But some new colnames are not prefixed like
their peers.

test_sub=# \pset x
Expanded display is on.
test_sub=# select * from pg_subscription;
-[ RECORD 1 ]-----+--------------------------------------------------------
oid | 16394
subdbid | 16384
subname | tap_sub
subowner | 10
subenabled | t
disable_on_error | t
disabled_by_error | f
subbinary | f
substream | f
subconninfo | host=localhost dbname=test_pub application_name=tap_sub
subslotname | tap_sub
subsynccommit | off
suberrmsg |
subpublications | {tap_pub}

======

Test: Cause a PK violation during normal Apply replication (when
"disabled_on_error=true")
Observation: Apply worker stops. Subscription is disabled. Error
message is in the catalog.

2021-06-18 15:12:45.905 AEST [25904] LOG: edata is true for
subscription 'tap_sub': message = "duplicate key value violates unique
constraint "test_tab_pkey"", hint = "<NONE>"
2021-06-18 15:12:45.905 AEST [25904] LOG: logical replication apply
worker for subscription "tap_sub" will stop because the subscription
was disabled due to error
2021-06-18 15:12:45.905 AEST [25904] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:12:45.905 AEST [25904] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:12:45.908 AEST [19924] LOG: background worker "logical
replication worker" (PID 25904) exited with exit code 1

test_sub=# select * from pg_subscription;
-[ RECORD 1 ]-----+---------------------------------------------------------------
oid | 16394
subdbid | 16384
subname | tap_sub
subowner | 10
subenabled | f
disable_on_error | t
disabled_by_error | t
subbinary | f
substream | f
subconninfo | host=localhost dbname=test_pub application_name=tap_sub
subslotname | tap_sub
subsynccommit | off
suberrmsg | duplicate key value violates unique constraint
"test_tab_pkey"
subpublications | {tap_pub}

======

Test: Try to enable subscription (without fixing the PK violation problem).
Observation. OK. It just stops again

test_sub=# alter subscription tap_sub enable;
ALTER SUBSCRIPTION
test_sub=# 2021-06-18 15:17:18.067 AEST [10228] LOG: logical
replication apply worker for subscription "tap_sub" has started
2021-06-18 15:17:18.078 AEST [10228] LOG: edata is true for
subscription 'tap_sub': message = "duplicate key value violates unique
constraint "test_tab_pkey"", hint = "<NONE>"
2021-06-18 15:17:18.078 AEST [10228] LOG: logical replication apply
worker for subscription "tap_sub" will stop because the subscription
was disabled due to error
2021-06-18 15:17:18.078 AEST [10228] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:17:18.078 AEST [10228] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:17:18.079 AEST [19924] LOG: background worker "logical
replication worker" (PID 10228) exited with exit code 1

======

Test: Manually disable the subscription (which had previously already
been disabled due to error)
Observation: OK. The suberrmsg gets reset to an empty string.

alter subscription tap_sub disable;

=====

Test: Turn off the disable_on_error
Observation: As expected, now the Apply worker goes into re-launch
forever loop every time it hits PK violation

test_sub=# alter subscription tap_sub set (disable_on_error=false);
ALTER SUBSCRIPTION

...

======

Test: Cause a PK violation in the Tablesync copy (DATASYNC) phase.
(when disable_on_error = true)
Observation: This patch changes nothing for this case. The Tablesyn
re-launchs in a forever loop the same as current functionality.

test_sub=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost
dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub WITH
(disable_on_error=false);
NOTICE: created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
test_sub=# 2021-06-18 15:38:19.547 AEST [18205] LOG: logical
replication apply worker for subscription "tap_sub" has started
2021-06-18 15:38:19.557 AEST [18207] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:19.610 AEST [18207] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:19.610 AEST [18207] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:38:19.610 AEST [18207] CONTEXT: COPY test_tab, line 1
2021-06-18 15:38:19.611 AEST [19924] LOG: background worker "logical
replication worker" (PID 18207) exited with exit code 1
2021-06-18 15:38:24.634 AEST [18369] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:24.689 AEST [18369] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:24.689 AEST [18369] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:38:24.689 AEST [18369] CONTEXT: COPY test_tab, line 1
2021-06-18 15:38:24.690 AEST [19924] LOG: background worker "logical
replication worker" (PID 18369) exited with exit code 1
2021-06-18 15:38:29.701 AEST [18521] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:29.765 AEST [18521] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:29.765 AEST [18521] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:38:29.765 AEST [18521] CONTEXT: COPY test_tab, line 1
2021-06-18 15:38:29.766 AEST [19924] LOG: background worker "logical
replication worker" (PID 18521) exited with exit code 1
etc...

-[ RECORD 1 ]-----+--------------------------------------------------------
oid | 16399
subdbid | 16384
subname | tap_sub
subowner | 10
subenabled | t
disable_on_error | f
disabled_by_error | f
subbinary | f
substream | f
subconninfo | host=localhost dbname=test_pub application_name=tap_sub
subslotname | tap_sub
subsynccommit | off
suberrmsg |
subpublications | {tap_pub}

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#4Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amit Kapila (#2)
Re: Optionally automatically disable logical replication subscriptions on error

On Jun 17, 2021, at 9:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

(a) The patch
seem to be assuming that the error can happen only by the apply worker
but I think the constraint violation can happen via one of the table
sync workers as well

You are right. Peter mentioned the same thing, and it is clearly so. I am working to repair this fault in v2 of the patch.

(b) What happens if the error happens when you
are updating the error information in the catalog table.

I think that is an entirely different kind of error. The patch attempts to catch errors caused by the user, not by core functionality of the system failing. If there is a fault that prevents the catalogs from being updated, it is unclear what the patch can do about that.

I think
instead of seeing the actual apply time error, the user might see some
other for which it won't be clear what is an appropriate action.

Good point.

Before trying to do much of anything with the caught error, the v2 patch logs the error. If the subsequent efforts to disable the subscription fail, at least the logs should contain the initial failure message. The v1 patch emitted a log message much further down, and really just intended for debugging the patch itself, with many opportunities for something else to throw before the log is written.

We are also discussing another action like skipping the apply of the
transaction on an error [1]. I think it is better to evaluate both the
proposals as one seems to be an extension of another.

Thanks for the link.

I think they are two separate options. For some users and data patterns, subscriber-side skipping of specific problematic commits will be fine. For other usage patterns, skipping earlier commits will results in more and more data integrity problems (foreign key references, etc.) such that the failures will snowball with skipping becoming the norm rather than the exception. Users with those usage patterns would likely prefer the subscription to automatically be disabled until manual intervention can clean up the problem.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Peter Smith (#3)
1 attachment(s)
Re: Optionally automatically disable logical replication subscriptions on error

On Jun 17, 2021, at 11:34 PM, Peter Smith <smithpb2250@gmail.com> wrote:

I tried your patch.

Thanks for the quick and thorough review!

(2) New column "disabled_by_error".

I wondered if there was actually any need for this column. Isn't the
same information conveyed by just having "subenabled" = false, at same
time as as non-empty "suberrmsg"? This would remove any confusion for
having 2 booleans which both indicate disabled.

Yeah, I wondered about that before posting v1. I removed the disabled_by_error field for v2.

(3) New columns "disabled_by_error", "disabled_on_error".

All other columns of the pg_subscription have a "sub" prefix.

I don't feel strongly about this. How about "subdisableonerr"? I used that in v2.

I did not find any code using that newly added member "errhint".

Thanks for catching that. I had tried to remove all references to "errhint" before posting v1. The original idea was that both the message and hint of the error would be kept, but in testing I found the hint field was typically empty, so I removed it. Sorry that I left one mention of it lying around.

(5) dump.c

I didn't bother getting pg_dump working before posting v1, and I still have not done so, as I mainly want to solicit feedback on whether the basic direction I am going will work for the community.

(6) Patch only handles errors only from the Apply worker.

Tablesync can give similar errors (e.g. PK violation during DATASYNC
phase) which will trigger re-launch forever regardless of the setting
of "disabled_on_error".
(confirmed by observations below)

Yes, this is a good point, and also mentioned by Amit. I have fixed it in v2 and adjusted the regression test to trigger an automatic disabling for initial table sync as well as for change replication.

2021-06-18 15:12:45.905 AEST [25904] LOG: edata is true for
subscription 'tap_sub': message = "duplicate key value violates unique
constraint "test_tab_pkey"", hint = "<NONE>"

You didn't call this out, but FYI, I don't intend to leave this particular log message in the patch. It was for development only. I have removed it for v2 and have added a different log message much sooner after catching the error, to avoid squashing the error in case some other action fails.

The regression test shows this, if you open tmp_check/log/022_disable_on_error_subscriber.log:

2021-06-18 16:25:20.138 PDT [56926] LOG: logical replication subscription "s1" will be disabled due to error: duplicate key value violates unique constraint "s1_tbl_unique"
2021-06-18 16:25:20.139 PDT [56926] ERROR: duplicate key value violates unique constraint "s1_tbl_unique"
2021-06-18 16:25:20.139 PDT [56926] DETAIL: Key (i)=(1) already exists.
2021-06-18 16:25:20.139 PDT [56926] CONTEXT: COPY tbl, line 2

The first line logs the error prior to attempting to disable the subscription, and the next three lines are due to rethrowing the error after committing the successful disabling of the subscription. If the attempt to disable the subscription itself throws, these additional three lines won't show up, but the first one should. Amit mentioned this upthread. Do you think this will be ok, or would you like to also have a suberrdetail field so that the detail doesn't get lost? I haven't added such an extra field, and am inclined to think it would be excessive, but maybe others feel differently?

======

Test: Cause a PK violation in the Tablesync copy (DATASYNC) phase.
(when disable_on_error = true)
Observation: This patch changes nothing for this case. The Tablesyn
re-launchs in a forever loop the same as current functionality.

In v2, tablesync copy errors should also be caught. The test has been extended to cover this also.

Attachments:

v2-0001-Optionally-disabling-subscriptions-on-error.patchapplication/octet-stream; name=v2-0001-Optionally-disabling-subscriptions-on-error.patch; x-unix-mode=0644
#6Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Dilger (#4)
Re: Optionally automatically disable logical replication subscriptions on error

On Sat, Jun 19, 2021 at 1:06 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 17, 2021, at 9:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

We are also discussing another action like skipping the apply of the
transaction on an error [1]. I think it is better to evaluate both the
proposals as one seems to be an extension of another.

Thanks for the link.

I think they are two separate options.

Right, but there are things that could be common from the design
perspective. For example, why is it mandatory to update this conflict
( error) information in the system catalog instead of displaying it
via some stats view? Also, why not also log the xid of the failed
transaction?

--
With Regards,
Amit Kapila.

#7Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amit Kapila (#6)
Re: Optionally automatically disable logical replication subscriptions on error

On Jun 19, 2021, at 3:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Right, but there are things that could be common from the design
perspective.

I went to reconcile my patch with that from [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com only to discover there is no patch on that thread. Is there one in progress that I can see?

I don't mind trying to reconcile this patch with what you're discussing in [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com, but I am a bit skeptical about [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com becoming a reality and I don't want to entirely hitch this patch to that effort. This can be committed with or without any solution to the idea in [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com. The original motivation for this patch was that the TAP tests don't have a great way to deal with a subscription getting into a fail-retry infinite loop, which makes it harder for me to make progress on [2]/messages/by-id/915B995D-1D79-4E0A-BD8D-3B267925FCE9@enterprisedb.com — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company. That doesn't absolve me of the responsibility of making this patch a good one, but it does motivate me to keep it simple.

For example, why is it mandatory to update this conflict
( error) information in the system catalog instead of displaying it
via some stats view?

The catalog must be updated to disable the subscription, so placing the error information in the same row doesn't require any independent touching of the catalogs. Likewise, the catalog must be updated to re-enable the subscription, so clearing the error from that same row doesn't require any independent touching of the catalogs.

The error information does not *need* to be included in the catalog, but placing the information in any location that won't survive server restart leaves the user no information about why the subscription got disabled after a restart (or crash + restart) happens.

Furthermore, since v2 removed the "disabled_by_error" field in favor of just using subenabled + suberrmsg to determine if the subscription was automatically disabled, not having the information in the catalog would make it ambiguous whether the subscription was manually or automatically disabled.

Also, why not also log the xid of the failed
transaction?

We could also do that. Reading [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com, it seems you are overly focused on user-facing xids. The errdetail in the examples I've been using for testing, and the one mentioned in [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com, contain information about the conflicting data. I think users are more likely to understand that a particular primary key value cannot be replicated because it is not unique than to understand that a particular xid cannot be replicated. (Likewise for permissions errors.) For example:

2021-06-18 16:25:20.139 PDT [56926] ERROR: duplicate key value violates unique constraint "s1_tbl_unique"
2021-06-18 16:25:20.139 PDT [56926] DETAIL: Key (i)=(1) already exists.
2021-06-18 16:25:20.139 PDT [56926] CONTEXT: COPY tbl, line 2

This tells the user what they need to clean up before they can continue. Telling them which xid tried to apply the change, but not the change itself or the conflict itself, seems rather unhelpful. So at best, the xid is an additional piece of information. I'd rather have both the ERROR and DETAIL fields above and not the xid than have the xid and lack one of those two fields. Even so, I have not yet included the DETAIL field because I didn't want to bloat the catalog.

For the problem in [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com, having the xid is more important than it is in my patch, because the user is expected in [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com to use the xid as a handle. But that seems like an odd interface to me. Imagine that a transaction on the publisher side inserted a batch of data, and only a subset of that data conflicts on the subscriber side. What advantage is there in skipping the entire transaction? Wouldn't the user rather skip just the problematic rows? I understand that on the subscriber side it is difficult to do so, but if you are going to implement this sort of thing, it makes more sense to allow the user to filter out data that is problematic rather than filtering out xids that are problematic, and the filter shouldn't just be an in-or-out filter, but rather a mapping function that can redirect the data someplace else or rewrite it before inserting or change the pre-existing conflicting data prior to applying the problematic data or whatever. That's a huge effort, of course, but if the idea in [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com goes in that direction, I don't want my patch to have already added an xid field which ultimately nobody wants.

[1]: /messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com

[2]: /messages/by-id/915B995D-1D79-4E0A-BD8D-3B267925FCE9@enterprisedb.com — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#7)
Re: Optionally automatically disable logical replication subscriptions on error

On Jun 19, 2021, at 7:44 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

Wouldn't the user rather skip just the problematic rows? I understand that on the subscriber side it is difficult to do so, but if you are going to implement this sort of thing, it makes more sense to allow the user to filter out data that is problematic rather than filtering out xids that are problematic, and the filter shouldn't just be an in-or-out filter, but rather a mapping function that can redirect the data someplace else or rewrite it before inserting or change the pre-existing conflicting data prior to applying the problematic data or whatever.

Thinking about this some more, it seems my patch already sets the stage for this sort of thing.

We could extend the concept of triggers to something like ErrorTriggers that could be associated with subscriptions. I already have the code catching errors for subscriptions where disable_on_error is true. We could use that same code path for subscriptions that have one or more BEFORE or AFTER ErrorTriggers defined. We could pass the trigger all the error context information along with the row and subscription information, and allow the trigger to either modify the data being replicated or make modifications to the table being changed. I think having support for both BEFORE and AFTER would be important, as a common design pattern might be to move aside the conflicting rows in the BEFORE trigger, then reconcile and merge them back into the table in the AFTER trigger. If the xid still cannot be replicated after one attempt using the triggers, the second attempt to disable the subscription instead.

There are a lot of details to consider, but to my mind this idea is much more user friendly than the idea that users should muck about with xids for arbitrarily many conflicting transactions.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Mark Dilger (#7)
Re: Optionally automatically disable logical replication subscriptions on error

On Sat, Jun 19, 2021 at 11:44 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 19, 2021, at 3:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Right, but there are things that could be common from the design
perspective.

I went to reconcile my patch with that from [1] only to discover there is no patch on that thread. Is there one in progress that I can see?

I will submit the patch.

I don't mind trying to reconcile this patch with what you're discussing in [1], but I am a bit skeptical about [1] becoming a reality and I don't want to entirely hitch this patch to that effort. This can be committed with or without any solution to the idea in [1]. The original motivation for this patch was that the TAP tests don't have a great way to deal with a subscription getting into a fail-retry infinite loop, which makes it harder for me to make progress on [2]. That doesn't absolve me of the responsibility of making this patch a good one, but it does motivate me to keep it simple.

There was a discussion that the skipping transaction patch would also
need to have a feature that tells users the details of the last
failure transaction such as its XID, timestamp, action etc. In that
sense, those two patches might need the common infrastructure that the
apply workers leave the error details somewhere so that the users can
see it.

For example, why is it mandatory to update this conflict
( error) information in the system catalog instead of displaying it
via some stats view?

The catalog must be updated to disable the subscription, so placing the error information in the same row doesn't require any independent touching of the catalogs. Likewise, the catalog must be updated to re-enable the subscription, so clearing the error from that same row doesn't require any independent touching of the catalogs.

The error information does not *need* to be included in the catalog, but placing the information in any location that won't survive server restart leaves the user no information about why the subscription got disabled after a restart (or crash + restart) happens.

Furthermore, since v2 removed the "disabled_by_error" field in favor of just using subenabled + suberrmsg to determine if the subscription was automatically disabled, not having the information in the catalog would make it ambiguous whether the subscription was manually or automatically disabled.

Is it really useful to write only error message to the system catalog?
Even if we see the error message like "duplicate key value violates
unique constraint “test_tab_pkey”” on the system catalog, we will end
up needing to check the server log for details to properly resolve the
conflict. If the user wants to know whether the subscription is
disabled manually or automatically, the error message on the system
catalog might not necessarily be necessary.

For the problem in [1], having the xid is more important than it is in my patch, because the user is expected in [1] to use the xid as a handle. But that seems like an odd interface to me. Imagine that a transaction on the publisher side inserted a batch of data, and only a subset of that data conflicts on the subscriber side. What advantage is there in skipping the entire transaction? Wouldn't the user rather skip just the problematic rows? I understand that on the subscriber side it is difficult to do so, but if you are going to implement this sort of thing, it makes more sense to allow the user to filter out data that is problematic rather than filtering out xids that are problematic, and the filter shouldn't just be an in-or-out filter, but rather a mapping function that can redirect the data someplace else or rewrite it before inserting or change the pre-existing conflicting data prior to applying the problematic data or whatever. That's a huge effort, of course, but if the idea in [1] goes in that direction, I don't want my patch to have already added an xid field which ultimately nobody wants.

The feature discussed in that thread is meant to be a repair tool for
the subscription in emergency cases when something that should not
have happened happened. I guess that resolving row (or column) level
conflict should be done in another way, for example, by defining
policies for each type of conflict.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#10Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Masahiko Sawada (#9)
Re: Optionally automatically disable logical replication subscriptions on error

On Jun 20, 2021, at 7:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I will submit the patch.

Great, thanks!

There was a discussion that the skipping transaction patch would also
need to have a feature that tells users the details of the last
failure transaction such as its XID, timestamp, action etc. In that
sense, those two patches might need the common infrastructure that the
apply workers leave the error details somewhere so that the users can
see it.

Right. Subscription on error triggers would need that, too, if we wrote them.

Is it really useful to write only error message to the system catalog?
Even if we see the error message like "duplicate key value violates
unique constraint “test_tab_pkey”” on the system catalog, we will end
up needing to check the server log for details to properly resolve the
conflict. If the user wants to know whether the subscription is
disabled manually or automatically, the error message on the system
catalog might not necessarily be necessary.

We can put more information in there. I don't feel strongly about it. I'll wait for your patch to see what infrastructure you need.

The feature discussed in that thread is meant to be a repair tool for
the subscription in emergency cases when something that should not
have happened happened. I guess that resolving row (or column) level
conflict should be done in another way, for example, by defining
policies for each type of conflict.

I understand that is the idea, but I'm having trouble believing it will work that way in practice. If somebody has a subscription that has gone awry, what reason do we have to believe there will only be one transaction that will need to be manually purged? It seems just as likely that there would be a million transactions that need to be purged, and creating an interface for users to manually review them and keep or discard on a case by case basis seems unworkable. Sure, you might have specific cases where the number of transactions to purge is small, but I don't like designing the feature around that assumption.

All the same, I'm looking forward to seeing your patch!


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Dilger (#7)
Re: Optionally automatically disable logical replication subscriptions on error

On Sat, Jun 19, 2021 at 8:14 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 19, 2021, at 3:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Also, why not also log the xid of the failed
transaction?

We could also do that. Reading [1], it seems you are overly focused on user-facing xids. The errdetail in the examples I've been using for testing, and the one mentioned in [1], contain information about the conflicting data. I think users are more likely to understand that a particular primary key value cannot be replicated because it is not unique than to understand that a particular xid cannot be replicated. (Likewise for permissions errors.) For example:

2021-06-18 16:25:20.139 PDT [56926] ERROR: duplicate key value violates unique constraint "s1_tbl_unique"
2021-06-18 16:25:20.139 PDT [56926] DETAIL: Key (i)=(1) already exists.
2021-06-18 16:25:20.139 PDT [56926] CONTEXT: COPY tbl, line 2

This tells the user what they need to clean up before they can continue. Telling them which xid tried to apply the change, but not the change itself or the conflict itself, seems rather unhelpful. So at best, the xid is an additional piece of information. I'd rather have both the ERROR and DETAIL fields above and not the xid than have the xid and lack one of those two fields. Even so, I have not yet included the DETAIL field because I didn't want to bloat the catalog.

I never said that we don't need the error information. I think we need
xid along with other things.

For the problem in [1], having the xid is more important than it is in my patch, because the user is expected in [1] to use the xid as a handle. But that seems like an odd interface to me. Imagine that a transaction on the publisher side inserted a batch of data, and only a subset of that data conflicts on the subscriber side. What advantage is there in skipping the entire transaction? Wouldn't the user rather skip just the problematic rows?

I think skipping some changes but not others can make the final
transaction data inconsistent. Say, we have a case where, in a
transaction after insert, there is an update or delete on the same
row, then we might silently skip such updates/deletes unless the same
row is already present in the subscriber. I think skipping the entire
transaction based on user instruction would be safer than skipping
some changes that lead to an error.

--
With Regards,
Amit Kapila.

#12Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Dilger (#10)
Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Jun 21, 2021 at 7:56 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 20, 2021, at 7:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I will submit the patch.

Great, thanks!

There was a discussion that the skipping transaction patch would also
need to have a feature that tells users the details of the last
failure transaction such as its XID, timestamp, action etc. In that
sense, those two patches might need the common infrastructure that the
apply workers leave the error details somewhere so that the users can
see it.

Right. Subscription on error triggers would need that, too, if we wrote them.

Is it really useful to write only error message to the system catalog?
Even if we see the error message like "duplicate key value violates
unique constraint “test_tab_pkey”” on the system catalog, we will end
up needing to check the server log for details to properly resolve the
conflict. If the user wants to know whether the subscription is
disabled manually or automatically, the error message on the system
catalog might not necessarily be necessary.

I think the two key points are (a) to define exactly what all
information is required to be logged on error, (b) where do we want to
store the information based on requirements. I see that for (b) Mark
is inclined to use the existing catalog table. I feel that is worth
considering but not sure if that is the best way to deal with it. For
example, if we store that information in the catalog, we might need to
consider storing it both in pg_subscription and pg_subscription_rel,
otherwise, we might overwrite the errors as I think what is happening
in the currently proposed patch. The other possibilities could be to
define a new catalog table to capture the error information or log the
required information via stats collector and then the user can see
that info via some stats view.

We can put more information in there. I don't feel strongly about it. I'll wait for your patch to see what infrastructure you need.

The feature discussed in that thread is meant to be a repair tool for
the subscription in emergency cases when something that should not
have happened happened. I guess that resolving row (or column) level
conflict should be done in another way, for example, by defining
policies for each type of conflict.

I understand that is the idea, but I'm having trouble believing it will work that way in practice. If somebody has a subscription that has gone awry, what reason do we have to believe there will only be one transaction that will need to be manually purged?

Because currently, we don't proceed after an error unless it is
resolved. Why do you think there could be multiple such transactions?

--
With Regards,
Amit Kapila.

#13Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#12)
Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Jun 21, 2021 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 7:56 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 20, 2021, at 7:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I will submit the patch.

Great, thanks!

There was a discussion that the skipping transaction patch would also
need to have a feature that tells users the details of the last
failure transaction such as its XID, timestamp, action etc. In that
sense, those two patches might need the common infrastructure that the
apply workers leave the error details somewhere so that the users can
see it.

Right. Subscription on error triggers would need that, too, if we wrote them.

Is it really useful to write only error message to the system catalog?
Even if we see the error message like "duplicate key value violates
unique constraint “test_tab_pkey”” on the system catalog, we will end
up needing to check the server log for details to properly resolve the
conflict. If the user wants to know whether the subscription is
disabled manually or automatically, the error message on the system
catalog might not necessarily be necessary.

I think the two key points are (a) to define exactly what all
information is required to be logged on error,

When it comes to the patch for skipping transactions, it would
somewhat depend on how users specify transactions to skip. On the
other hand, for this patch, the minimal information would be whether
the subscription is disabled automatically by the server.

(b) where do we want to
store the information based on requirements. I see that for (b) Mark
is inclined to use the existing catalog table. I feel that is worth
considering but not sure if that is the best way to deal with it. For
example, if we store that information in the catalog, we might need to
consider storing it both in pg_subscription and pg_subscription_rel,
otherwise, we might overwrite the errors as I think what is happening
in the currently proposed patch. The other possibilities could be to
define a new catalog table to capture the error information or log the
required information via stats collector and then the user can see
that info via some stats view.

This point is also related to the point whether or not that
information needs to last after the server crash (and restart). When
it comes to the patch for skipping transactions, there was a
discussion that we don’t necessarily need it since the tools will be
used in rare cases. But for this proposed patch, I guess it would be
useful if it does. It might be worth considering doing a different way
for each patch. For example, we send the details of last failure
transaction to the stats collector while updating subenabled to
something like “automatically-disabled” instead of to just “false” (or
using another column to show the subscriber is disabled automatically
by the server).

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#14Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amit Kapila (#12)
Re: Optionally automatically disable logical replication subscriptions on error

On Jun 20, 2021, at 8:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Because currently, we don't proceed after an error unless it is
resolved. Why do you think there could be multiple such transactions?

Just as one example, if the subscriber has a unique index that the publisher lacks, any number of transactions could add non-unique data that then fails to apply on the subscriber. My patch took the view that the user should figure out how to get the subscriber side consistent with the publisher side, but if you instead take the approach that problematic commits should be skipped, it would seem that arbitrarily many such transactions could be committed on the publisher side.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Dilger (#14)
Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Jun 21, 2021 at 10:24 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 20, 2021, at 8:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Because currently, we don't proceed after an error unless it is
resolved. Why do you think there could be multiple such transactions?

Just as one example, if the subscriber has a unique index that the publisher lacks, any number of transactions could add non-unique data that then fails to apply on the subscriber.

Then also it will fail on the first such conflict, so even without
your patch, the apply worker corresponding to the subscription won't
be able to proceed after the first error, it won't lead to multiple
failing xids. However, I see a different case where there could be
multiple failing xids and that can happen during initial table sync
where multiple workers failed due to some error. I am not sure your
patch would be able to capture all such failed transactions because
you are recording this information in pg_subscription and not in
pg_subscription_rel.

--
With Regards,
Amit Kapila.

#16Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amit Kapila (#12)
Re: Optionally automatically disable logical replication subscriptions on error

On Jun 20, 2021, at 8:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

(a) to define exactly what all
information is required to be logged on error, (b) where do we want to
store the information based on requirements.

I'm not sure it has to be stored anywhere durable. I have a patch in the works to do something like:

create function foreign_key_insert_violation_before() returns conflict_trigger as $$
BEGIN
RAISE NOTICE 'elevel: %', TG_ELEVEL:
RAISE NOTICE 'sqlerrcode: %', TG_SQLERRCODE:
RAISE NOTICE 'message: %', TG_MESSAGE:
RAISE NOTICE 'detail: %', TG_DETAIL:
RAISE NOTICE 'detail_log: %', TG_DETAIL_LOG:
RAISE NOTICE 'hint: %', TG_HINT:
RAISE NOTICE 'schema: %', TG_SCHEMA_NAME:
RAISE NOTICE 'table: %', TG_TABLE_NAME:
RAISE NOTICE 'column: %', TG_COLUMN_NAME:
RAISE NOTICE 'datatype: %', TG_DATATYPE_NAME:
RAISE NOTICE 'constraint: %', TG_CONSTRAINT_NAME:

-- do something useful to prepare for retry of transaction
-- which raised a foreign key violation
END
$$ language plpgsql;

create function foreign_key_insert_violation_after() returns conflict_trigger as $$
BEGIN
-- do something useful to cleanup after retry of transaction
-- which raised a foreign key violation
END
$$ language plpgsql;

create conflict trigger regress_conflict_trigger_insert on regress_conflictsub
before foreign_key_violation
when tag in ('INSERT')
execute procedure foreign_key_insert_violation_before();

create conflict trigger regress_conflict_trigger_insert on regress_conflictsub
after foreign_key_violation
when tag in ('INSERT')
execute procedure foreign_key_insert_violation_after();

The idea is that, for subscriptions that have conflict triggers defined, the apply will be wrapped in a PG_TRY()/PG_CATCH() block. If it fails, the ErrorData will be copied in the ConflictTriggerContext, and then the transaction will be attempted again, but this time with any BEFORE and AFTER triggers applied. The triggers could then return a special result indicating whether the transaction should be permanently skipped, applied, or whatever. None of the data needs to be stored anywhere non-transient, as it just gets handed to the triggers.

I think the other patch is a subset of this functionality, as using this system to create triggers which query a table containing transactions to be skipped would be enough to get the functionality you've been discussing. But this system could also do other things, like modify data. Admittedly, this is akin to a statement level trigger and not a row level trigger, so a number of things you might want to do would be hard to do from this. But perhaps the equivalent of row level triggers could also be written?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amit Kapila (#15)
Re: Optionally automatically disable logical replication subscriptions on error

On Jun 20, 2021, at 10:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Then also it will fail on the first such conflict, so even without
your patch, the apply worker corresponding to the subscription won't
be able to proceed after the first error, it won't lead to multiple
failing xids.

I'm not sure we're talking about the same thing. I'm saying that if the user is expected to clear each error manually, there could be many such errors for them to clear. It may be true that the second error doesn't occur on the subscriber side until after the first is cleared, but that still leaves the user having to clear one after the next until arbitrarily many of them coming from the publisher side are cleared.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amit Kapila (#15)
Re: Optionally automatically disable logical replication subscriptions on error

On Jun 20, 2021, at 10:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

However, I see a different case where there could be
multiple failing xids and that can happen during initial table sync
where multiple workers failed due to some error. I am not sure your
patch would be able to capture all such failed transactions because
you are recording this information in pg_subscription and not in
pg_subscription_rel.

Right, I wasn't trying to capture everything, just enough to give the user a reasonable indication of what went wrong. My patch was designed around the idea that the user would need to figure out how to fix the subscriber side prior to re-enabling the subscription. As such, I wasn't bothered with trying to store everything, just enough to give the user a clue where to look. I don't mind if you want to store more information, and maybe that needs to be stored somewhere else. Do you believe pg_subscription_rel is a suitable location?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#13)
Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Jun 21, 2021 at 9:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jun 21, 2021 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 7:56 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 20, 2021, at 7:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I will submit the patch.

Great, thanks!

There was a discussion that the skipping transaction patch would also
need to have a feature that tells users the details of the last
failure transaction such as its XID, timestamp, action etc. In that
sense, those two patches might need the common infrastructure that the
apply workers leave the error details somewhere so that the users can
see it.

Right. Subscription on error triggers would need that, too, if we wrote them.

Is it really useful to write only error message to the system catalog?
Even if we see the error message like "duplicate key value violates
unique constraint “test_tab_pkey”” on the system catalog, we will end
up needing to check the server log for details to properly resolve the
conflict. If the user wants to know whether the subscription is
disabled manually or automatically, the error message on the system
catalog might not necessarily be necessary.

I think the two key points are (a) to define exactly what all
information is required to be logged on error,

When it comes to the patch for skipping transactions, it would
somewhat depend on how users specify transactions to skip. On the
other hand, for this patch, the minimal information would be whether
the subscription is disabled automatically by the server.

True, but still there will be some information related to ERROR which
we wanted the user to see unless we ask them to refer to logs for
that.

(b) where do we want to
store the information based on requirements. I see that for (b) Mark
is inclined to use the existing catalog table. I feel that is worth
considering but not sure if that is the best way to deal with it. For
example, if we store that information in the catalog, we might need to
consider storing it both in pg_subscription and pg_subscription_rel,
otherwise, we might overwrite the errors as I think what is happening
in the currently proposed patch. The other possibilities could be to
define a new catalog table to capture the error information or log the
required information via stats collector and then the user can see
that info via some stats view.

This point is also related to the point whether or not that
information needs to last after the server crash (and restart). When
it comes to the patch for skipping transactions, there was a
discussion that we don’t necessarily need it since the tools will be
used in rare cases. But for this proposed patch, I guess it would be
useful if it does. It might be worth considering doing a different way
for each patch. For example, we send the details of last failure
transaction to the stats collector while updating subenabled to
something like “automatically-disabled” instead of to just “false” (or
using another column to show the subscriber is disabled automatically
by the server).

I agree that it is worth considering to have subenabled to have a
tri-state (enable, disabled, automatically-disabled) value instead of
just a boolean. But in this case, if the stats collector missed
updating the information, the user may have to manually update the
subscription and let the error happen again to see it.

--
With Regards,
Amit Kapila.

#20Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Dilger (#18)
Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Jun 21, 2021 at 10:55 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 20, 2021, at 10:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

However, I see a different case where there could be
multiple failing xids and that can happen during initial table sync
where multiple workers failed due to some error. I am not sure your
patch would be able to capture all such failed transactions because
you are recording this information in pg_subscription and not in
pg_subscription_rel.

Right, I wasn't trying to capture everything, just enough to give the user a reasonable indication of what went wrong. My patch was designed around the idea that the user would need to figure out how to fix the subscriber side prior to re-enabling the subscription. As such, I wasn't bothered with trying to store everything, just enough to give the user a clue where to look.

Okay, but the clue will be pretty random because you might end up just
logging one out of several errors.

I don't mind if you want to store more information, and maybe that needs to be stored somewhere else. Do you believe pg_subscription_rel is a suitable location?

It won't be sufficient to store information in either
pg_subscription_rel or pg_susbscription. I think if we want to store
the required information in a catalog then we need to define a new
catalog (pg_subscription_conflicts or something like that) with
information corresponding to each rel in subscription (srsubid oid
(Reference to subscription), srrelid oid (Reference to relation),
<columns for error_info>). OTOH, we can choose to send the error
information to stats collector which will then be available via stat
view and update system catalog to disable the subscription but there
will be a risk that we might send info of failed transaction to stats
collector but then fail to update system catalog to disable the
subscription.

--
With Regards,
Amit Kapila.

#21Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#20)
Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Jun 21, 2021 at 11:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 10:55 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

I don't mind if you want to store more information, and maybe that needs to be stored somewhere else. Do you believe pg_subscription_rel is a suitable location?

It won't be sufficient to store information in either
pg_subscription_rel or pg_susbscription. I think if we want to store
the required information in a catalog then we need to define a new
catalog (pg_subscription_conflicts or something like that) with
information corresponding to each rel in subscription (srsubid oid
(Reference to subscription), srrelid oid (Reference to relation),
<columns for error_info>). OTOH, we can choose to send the error
information to stats collector which will then be available via stat
view and update system catalog to disable the subscription but there
will be a risk that we might send info of failed transaction to stats
collector but then fail to update system catalog to disable the
subscription.

I think we should store the input from the user (like disable_on_error
flag or xid to skip) in the system catalog pg_subscription and send
the error information (subscrtion_id, rel_id, xid of failed xact,
error_code, error_message, etc.) to the stats collector which can be
used to display such information via a stat view.

The disable_on_error flag handling could be that on error it sends the
required error info to stats collector and then updates the subenabled
in pg_subscription. In rare conditions, where we are able to send the
message but couldn't update the subenabled info in pg_subscription
either due to some error or server restart, the apply worker would
again try to apply the same change and would hit the same error again
which I think should be fine because it will ultimately succeed.

The skip xid handling will also be somewhat similar where on an error,
we will send the error information to stats collector which will be
displayed via stats view. Then the user is expected to ask for skip
xid (Alter Subscription ... SKIP <xid_value>) based on information
displayed via stat view. Now, the apply worker can skip changes from
such a transaction, and then during processing of commit record of the
skipped transaction, it should update xid to invalid value, so that
next time that shouldn't be used. I think it is important to update
xid to an invalid value as part of the skipped transaction because
otherwise, after the restart, we won't be able to decide whether we
still want to skip the xid stored for a subscription.

--
With Regards,
Amit Kapila.

#22Peter Smith
Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#21)
Re: Optionally automatically disable logical replication subscriptions on error

Much of the discussion above seems to be related to where to store the
error information and how much information is needed to be useful.

As a summary, the 5 alternatives I have seen mentioned are:

#1. Store some simple message in the pg_subscription ("I wasn't trying
to capture everything, just enough to give the user a reasonable
indication of what went wrong" [Mark-1]). Storing the error message
was also seen as a convenience for writing TAP tests ("I originally
ran into the motivation to write this patch when frustrated that TAP
tests needed to parse the apply worker log file" [Mark-2}). It also
can sometimes provide a simple clue for the error (e.g. PK violation
for table TBL) but still the user will have to look elsewhere for
details to resolve the error. So while this implementation seems good
for simple scenarios, it appears to have been shot down because the
non-trivial scenarios either have insufficient or wrong information in
the error message. Some DETAILS could have been added to give more
information but that would maybe bloat the catalog ("I have not yet
included the DETAIL field because I didn't want to bloat the catalog."
[Mark-3])

#2. Similarly another idea was to use another existing catalog table
pg_subscription_rel. This could have the same problems ("It won't be
sufficient to store information in either pg_subscription_rel or
pg_susbscription." [Amit-1])

#3. There is another suggestion to use the Stats Collector to hold the
error message [Amit-2]. For me, this felt like blurring too much the
distinction between "stats tracking/metrics" and "logs". ERROR logs
must be flushed, whereas for stats (IIUC) there is no guarantee that
everything you need to see would be present. Indeed Amit wrote "But in
this case, if the stats collector missed updating the information, the
user may have to manually update the subscription and let the error
happen again to see it." [Amit-3]. Requesting the user to cause the
same error again just in case it was not captured a first time seems
too strange to me.

#4. The next idea was to have an entirely new catalog for holding the
subscription error information. I feel that storing/duplicating lots
of error information in another table seems like a bridge too far.
What about the risks of storing incorrect or sufficient information?
What is the advantage of duplicating errors over just referring to the
log files for ERROR details?

#5. Document to refer to the logs. All ERROR details are already in
the logs, and this seems to me the intuitive place to look for them.
Searching for specific errors becomes difficult programmatically (is
this really a problem other than complex TAP tests?). But here there
is no risk of missing or insufficient information captured in the log
files ("but still there will be some information related to ERROR
which we wanted the user to see unless we ask them to refer to logs
for that." [Amit-4}).

---

My preferred alternative is #5. ERRORs are logged in the log file, so
there is nothing really for this patch to do in this regard (except
documentation), and there is no risk of missing any information, no
ambiguity of having duplicated errors, and it is the intuitive place
the user would look.

So I felt current best combination is just this:
a) A tri-state indicating the state of the subscription: e.g.
something like "enabled" ('e')/ "disabled" ('d') / "auto-disabled"
('a') [Amit-5]
b) For "auto-disabled" the PG docs would be updated tell the user to
check the logs to resolve the problem before re-enabling the
subscription

//////////

IMO it is not made exactly clear to me what is the main goal of this
patch. Because of this, I feel that you can't really judge if this new
option is actually useful or not except only in hindsight. It seems
like whatever you implement can be made to look good or bad, just by
citing different test scenarios.

e.g.

* Is the goal mainly to help automated (TAP) testing? In that case,
then maybe you do want to store the error message somewhere other than
the log files. But still I wonder if results would be unpredictable
anyway - e.g if there are multiple tables all with errors then it
depends on the tablesync order of execution which error you see caused
the auto-disable, right? And if it is not predictable maybe it is less
useful.

* Is the goal to prevent some *unattended* SUBSCRIPTION from going bad
at some point in future and then going into a relaunch loop for
days/weeks and causing 1000's of errors without the user noticing. In
that case, this patch seems to be quite useful, but for this goal
maybe you don't want to be checking the tablesync workers at all, but
should only be checking the apply worker like your original v1 patch
did.

* Is the goal just to be a convenient way to disable the subscription
during the CREATE SUBSCRIPTION phase so that the user can make
corrections in peace without the workers re-launching and making more
error logs? Here the patch is helpful, but only for simple scenarios
like 1 faulty table. Imagine if there are 10 tables (all with PK
violations at DATASYNC copy) then you will encounter them one at a
time and have to re-enable the subscription 10 times, after fixing
each error in turn. So in this scenario the new option might be more
of a hindrance than a help because it would be easier if the user just
did "ALTER SUBSCRIPTION sub DISABLE" manually and fixed all the
problems in one sitting before re-enabling.

* etc

//////////

Finally, here is one last (crazy?) thought-bubble just for
consideration. I might be wrong, but my gut feeling is that the Stats
Collector is intended more for "tracking" and for "metrics" rather
than for holding duplicates of logged error messages. At the same
time, I felt that disabling an entire subscription due to a single
rogue error might be overkill sometimes. But I wonder if there is a
way to combine those two ideas so that the Stats Collector gets some
new counter for tracking the number of worker re-launches that have
occurred, meanwhile there could be a subscription option which gives a
threshold above which you would disable the subscription.
e.g.
"disable_on_error_threshold=0" default, relaunch forever
"disable_on_error_threshold=1" disable upon first error encountered.
(This is how your patch behaves now I think.)
"disable_on_error_threshold=500" disable if the re-launch errors go
unattended and happen 500 times.

------
[Mark-1] /messages/by-id/A539C848-670E-454F-B31C-82D3CBE9F5AC@enterprisedb.com
[Mark-2] /messages/by-id/DB35438F-9356-4841-89A0-412709EBD3AB@enterprisedb.com
[Mark-3] /messages/by-id/DE7E13B7-DC76-416A-A98F-3BC3F80E6BE9@enterprisedb.com
[Amit-1] /messages/by-id/CAA4eK1K_JFSFrAkr_fgp3VX6hTSmjK=wNs4Tw8rUWHGp0+Bsaw@mail.gmail.com
[Amit-2] /messages/by-id/CAA4eK1+NoRbYSH1J08zi4OJ_EUMcjmxTwnmwVqZ6e_xzS0D6VA@mail.gmail.com
[Amit-3] /messages/by-id/CAA4eK1Kyx6U9yxC7OXoBD7pHC3bJ4LuNGd=OiABmiW6+qG+vEQ@mail.gmail.com
[Amit-4] /messages/by-id/CAA4eK1Kyx6U9yxC7OXoBD7pHC3bJ4LuNGd=OiABmiW6+qG+vEQ@mail.gmail.com
[Amit-5] /messages/by-id/CAA4eK1Kyx6U9yxC7OXoBD7pHC3bJ4LuNGd=OiABmiW6+qG+vEQ@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

#23Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Peter Smith (#22)
Re: Optionally automatically disable logical replication subscriptions on error

On Jun 21, 2021, at 5:57 PM, Peter Smith <smithpb2250@gmail.com> wrote:

#5. Document to refer to the logs. All ERROR details are already in
the logs, and this seems to me the intuitive place to look for them.

My original motivation came from writing TAP tests to check that the permissions systems would properly deny the apply worker when running under a non-superuser role. The idea is that the user with the responsibility for managing subscriptions won't have enough privilege to read the logs. Whatever information that user needs (if any) must be someplace else.

Searching for specific errors becomes difficult programmatically (is
this really a problem other than complex TAP tests?).

I believe there is a problem, because I remain skeptical that these errors will be both existent and rare. Either you've configured your system correctly and you get zero of these, or you've misconfigured it and you get some non-zero number of them. I don't see any reason to assume that number will be small.

The best way to deal with that is to be able to tell the system what to do with them, like "if the error has this error code and the error message matches this regular expression, then do this, else do that." That's why I think allowing triggers to be created on subscriptions makes the most sense (though is probably the hardest system being proposed so far.)

But here there
is no risk of missing or insufficient information captured in the log
files ("but still there will be some information related to ERROR
which we wanted the user to see unless we ask them to refer to logs
for that." [Amit-4}).

Not only is there a problem if the user doesn't have permission to view the logs, but also, if we automatically disable the subscription until the error is manually cleared, the logs might be rotated out of existence before the user takes any action. In that case, the logs will be entirely missing, and not even the error message will remain. At least with the patch I submitted, the error message will remain, though I take Amit's point that there are deficiencies in handling parallel tablesync workers, etc.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#24Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Peter Smith (#22)
Re: Optionally automatically disable logical replication subscriptions on error

On Jun 21, 2021, at 5:57 PM, Peter Smith <smithpb2250@gmail.com> wrote:

* Is the goal mainly to help automated (TAP) testing?

Absolutely, that was my original motivation. But I don't think that is the primary reason the patch would be accepted. There is a cost to having the logical replication workers attempt ad infinitum to apply a transaction that will never apply.

Also, if you are waiting for a subscription to catch up, it is far from obvious that you will wait forever.

In that case,
then maybe you do want to store the error message somewhere other than
the log files. But still I wonder if results would be unpredictable
anyway - e.g if there are multiple tables all with errors then it
depends on the tablesync order of execution which error you see caused
the auto-disable, right? And if it is not predictable maybe it is less
useful.

But if you are writing a TAP test, you should be the one controlling whether that is the case. I don't think it would be unpredictable from the point of view of the test author.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#25Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#21)
Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Jun 21, 2021 at 7:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 11:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 10:55 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

I don't mind if you want to store more information, and maybe that needs to be stored somewhere else. Do you believe pg_subscription_rel is a suitable location?

It won't be sufficient to store information in either
pg_subscription_rel or pg_susbscription. I think if we want to store
the required information in a catalog then we need to define a new
catalog (pg_subscription_conflicts or something like that) with
information corresponding to each rel in subscription (srsubid oid
(Reference to subscription), srrelid oid (Reference to relation),
<columns for error_info>). OTOH, we can choose to send the error
information to stats collector which will then be available via stat
view and update system catalog to disable the subscription but there
will be a risk that we might send info of failed transaction to stats
collector but then fail to update system catalog to disable the
subscription.

I think we should store the input from the user (like disable_on_error
flag or xid to skip) in the system catalog pg_subscription and send
the error information (subscrtion_id, rel_id, xid of failed xact,
error_code, error_message, etc.) to the stats collector which can be
used to display such information via a stat view.

The disable_on_error flag handling could be that on error it sends the
required error info to stats collector and then updates the subenabled
in pg_subscription. In rare conditions, where we are able to send the
message but couldn't update the subenabled info in pg_subscription
either due to some error or server restart, the apply worker would
again try to apply the same change and would hit the same error again
which I think should be fine because it will ultimately succeed.

The skip xid handling will also be somewhat similar where on an error,
we will send the error information to stats collector which will be
displayed via stats view. Then the user is expected to ask for skip
xid (Alter Subscription ... SKIP <xid_value>) based on information
displayed via stat view. Now, the apply worker can skip changes from
such a transaction, and then during processing of commit record of the
skipped transaction, it should update xid to invalid value, so that
next time that shouldn't be used. I think it is important to update
xid to an invalid value as part of the skipped transaction because
otherwise, after the restart, we won't be able to decide whether we
still want to skip the xid stored for a subscription.

Sounds reasonable.

The feature that sends the error information to the stats collector is
a common feature for both and itself is also useful. As discussed in
that skip transaction patch thread, it would also be good if we write
error information (relation, action, xid, etc) to the server log too.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#26Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#22)
Re: Optionally automatically disable logical replication subscriptions on error

On Tue, Jun 22, 2021 at 6:27 AM Peter Smith <smithpb2250@gmail.com> wrote:

#3. There is another suggestion to use the Stats Collector to hold the
error message [Amit-2]. For me, this felt like blurring too much the
distinction between "stats tracking/metrics" and "logs". ERROR logs
must be flushed, whereas for stats (IIUC) there is no guarantee that
everything you need to see would be present. Indeed Amit wrote "But in
this case, if the stats collector missed updating the information, the
user may have to manually update the subscription and let the error
happen again to see it." [Amit-3]. Requesting the user to cause the
same error again just in case it was not captured a first time seems
too strange to me.

I don't think it will often be the case that the stats collector will
miss updating the information. I am not feeling comfortable storing
error information in system catalogs. We have some other views which
capture somewhat similar conflict information
(pg_stat_database_conflicts) or failed transactions information. So, I
thought here we are extending the similar concept by storing some
additional information about errors.

--
With Regards,
Amit Kapila.

#27Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Peter Smith (#22)
Re: Optionally automatically disable logical replication subscriptions on error

On Jun 21, 2021, at 5:57 PM, Peter Smith <smithpb2250@gmail.com> wrote:

* Is the goal to prevent some *unattended* SUBSCRIPTION from going bad
at some point in future and then going into a relaunch loop for
days/weeks and causing 1000's of errors without the user noticing. In
that case, this patch seems to be quite useful, but for this goal
maybe you don't want to be checking the tablesync workers at all, but
should only be checking the apply worker like your original v1 patch
did.

Yeah, my motivation was preventing an infinite loop, and providing a clean way for the users to know that replication they are waiting for won't ever complete, rather than having to infer that it will never halt.

* Is the goal just to be a convenient way to disable the subscription
during the CREATE SUBSCRIPTION phase so that the user can make
corrections in peace without the workers re-launching and making more
error logs?

No. This is not and never was my motivation. It's an interesting question, but that idea never crossed my mind. I'm not sure what changes somebody would want to make *after* creating the subscription. Certainly, there may be problems with how they have things set up, but they won't know that until the first error happens.

Here the patch is helpful, but only for simple scenarios
like 1 faulty table. Imagine if there are 10 tables (all with PK
violations at DATASYNC copy) then you will encounter them one at a
time and have to re-enable the subscription 10 times, after fixing
each error in turn.

You are assuming disable_on_error=true. It is false by default. But ok, let's accept that assumption for the sake of argument. Now, will you have to manually go through the process 10 times? I'm not sure. The user might figure out their mistake after seeing the first error.

So in this scenario the new option might be more
of a hindrance than a help because it would be easier if the user just
did "ALTER SUBSCRIPTION sub DISABLE" manually and fixed all the
problems in one sitting before re-enabling.

Yeah, but since the new option is off by default, I don't see any sensible complaint.

* etc

//////////

Finally, here is one last (crazy?) thought-bubble just for
consideration. I might be wrong, but my gut feeling is that the Stats
Collector is intended more for "tracking" and for "metrics" rather
than for holding duplicates of logged error messages. At the same
time, I felt that disabling an entire subscription due to a single
rogue error might be overkill sometimes.

I'm happy to entertain criticism of the particulars of how my patch approaches this problem, but it is already making a distinction between transient errors (resources, network, etc.) vs. ones that are non-transient. Again, I might not have drawn the line in the right place, but the patch is not intended to disable subscriptions in response to transient errors.

But I wonder if there is a
way to combine those two ideas so that the Stats Collector gets some
new counter for tracking the number of worker re-launches that have
occurred, meanwhile there could be a subscription option which gives a
threshold above which you would disable the subscription.
e.g.
"disable_on_error_threshold=0" default, relaunch forever
"disable_on_error_threshold=1" disable upon first error encountered.
(This is how your patch behaves now I think.)
"disable_on_error_threshold=500" disable if the re-launch errors go
unattended and happen 500 times.

That sounds like a misfeature to me. You could have a subscription that works fine for a month, surviving numerous short network outages, but then gets autodisabled after a longer network outage. I'm not sure why anybody would want that. You might argue for exponential backoff, where it never gets autodisabled on transient errors, but retries less frequently. But I don't want to expand the scope of this patch to include that, at least not without a lot more evidence that it is needed.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#28Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#21)
Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Jun 21, 2021 at 4:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 11:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think we should store the input from the user (like disable_on_error
flag or xid to skip) in the system catalog pg_subscription and send
the error information (subscrtion_id, rel_id, xid of failed xact,
error_code, error_message, etc.) to the stats collector which can be
used to display such information via a stat view.

The disable_on_error flag handling could be that on error it sends the
required error info to stats collector and then updates the subenabled
in pg_subscription. In rare conditions, where we are able to send the
message but couldn't update the subenabled info in pg_subscription
either due to some error or server restart, the apply worker would
again try to apply the same change and would hit the same error again
which I think should be fine because it will ultimately succeed.

The skip xid handling will also be somewhat similar where on an error,
we will send the error information to stats collector which will be
displayed via stats view. Then the user is expected to ask for skip
xid (Alter Subscription ... SKIP <xid_value>) based on information
displayed via stat view. Now, the apply worker can skip changes from
such a transaction, and then during processing of commit record of the
skipped transaction, it should update xid to invalid value, so that
next time that shouldn't be used. I think it is important to update
xid to an invalid value as part of the skipped transaction because
otherwise, after the restart, we won't be able to decide whether we
still want to skip the xid stored for a subscription.

One minor detail I missed in the above sketch for skipped transaction
feature was that actually we only need replication origin state from
the commit record of the skipped transaction and then I think we need
to start a transaction, update the xid value to invalid, set the
replication origin state and commit that transaction.

--
With Regards,
Amit Kapila.

#29Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Mark Dilger (#10)
Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Jun 21, 2021 at 11:26 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 20, 2021, at 7:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I will submit the patch.

Great, thanks!

I've submitted the patches on that thread[1]/messages/by-id/CAD21AoBU4jGEO6AXcykQ9y7tat0RrB5--8ZoJgfcj+LPs7nFZQ@mail.gmail.com. There are three patches:
skipping the transaction on the subscriber side, reporting error
details in the errcontext, and reporting the error details to the
stats collector. Feedback is very welcome.

[1]: /messages/by-id/CAD21AoBU4jGEO6AXcykQ9y7tat0RrB5--8ZoJgfcj+LPs7nFZQ@mail.gmail.com

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#30osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Masahiko Sawada (#29)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Monday, June 28, 2021 1:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jun 21, 2021 at 11:26 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 20, 2021, at 7:17 PM, Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

I will submit the patch.

Great, thanks!

I've submitted the patches on that thread[1]. There are three patches:
skipping the transaction on the subscriber side, reporting error details in the
errcontext, and reporting the error details to the stats collector. Feedback is
very welcome.

[1]
/messages/by-id/CAD21AoBU4jGEO6AXcykQ9y7tat0R
rB5--8ZoJgfcj%2BLPs7nFZQ%40mail.gmail.com

Hi, thanks Sawada-san for keep updating the skip xid patch in the thread.

This thread has stopped since the patch submission.
I've rebased the 'disable_on_error' option
so that it can be applied on top of skip xid shared in [1]/messages/by-id/CAD21AoDY-9_x819F_m1_wfCVXXFJrGiSmR2MfC9Nw4nW8Om0qA@mail.gmail.com.
I've written Mark Dilger as the original author in the commit message.

This patch is simply rebased to reactive this thread.
So there are still pending item to discuss for example,
how we should deal with multiple errors of several table sync workers.

I extracted only 'disable_on_error' option
because the skip xid and the latest error message fulfill the motivation
to make it easy to write TAP tests already I felt.

[1]: /messages/by-id/CAD21AoDY-9_x819F_m1_wfCVXXFJrGiSmR2MfC9Nw4nW8Om0qA@mail.gmail.com

Best Regards,
Takamichi Osumi

Attachments:

v3-Optionally-disabling-subscriptions-on-error.patchapplication/octet-stream; name=v3-Optionally-disabling-subscriptions-on-error.patch
#31vignesh C
vignesh C
vignesh21@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#30)
Re: Optionally automatically disable logical replication subscriptions on error

On Tue, Nov 2, 2021 at 4:12 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Monday, June 28, 2021 1:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jun 21, 2021 at 11:26 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 20, 2021, at 7:17 PM, Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

I will submit the patch.

Great, thanks!

I've submitted the patches on that thread[1]. There are three patches:
skipping the transaction on the subscriber side, reporting error details in the
errcontext, and reporting the error details to the stats collector. Feedback is
very welcome.

[1]
/messages/by-id/CAD21AoBU4jGEO6AXcykQ9y7tat0R
rB5--8ZoJgfcj%2BLPs7nFZQ%40mail.gmail.com

Hi, thanks Sawada-san for keep updating the skip xid patch in the thread.

This thread has stopped since the patch submission.
I've rebased the 'disable_on_error' option
so that it can be applied on top of skip xid shared in [1].
I've written Mark Dilger as the original author in the commit message.

This patch is simply rebased to reactive this thread.
So there are still pending item to discuss for example,
how we should deal with multiple errors of several table sync workers.

I extracted only 'disable_on_error' option
because the skip xid and the latest error message fulfill the motivation
to make it easy to write TAP tests already I felt.

Thanks for the updated patch. Please create a Commitfest entry for
this. It will help to have a look at CFBot results for the patch, also
if required rebase and post a patch on top of Head.

Regards,
Vignesh

#32osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: vignesh C (#31)
RE: Optionally automatically disable logical replication subscriptions on error

On Monday, November 8, 2021 10:15 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks for the updated patch. Please create a Commitfest entry for this. It will
help to have a look at CFBot results for the patch, also if required rebase and
post a patch on top of Head.

As requested, created a new entry for this - [1]https://commitfest.postgresql.org/36/3407/

FYI: the skip xid patch has been updated to v20 in [2]/messages/by-id/CAD21AoAT42mhcqeB1jPfRL1+EUHbZk8MMY_fBgsyZvJeKNpG+w@mail.gmail.com
but the v3 for disable_on_error is not affected by this update
and still applicable with no regression.

[1]: https://commitfest.postgresql.org/36/3407/
[2]: /messages/by-id/CAD21AoAT42mhcqeB1jPfRL1+EUHbZk8MMY_fBgsyZvJeKNpG+w@mail.gmail.com

Best Regards,
Takamichi Osumi

#33Greg Nancarrow
Greg Nancarrow
gregn4422@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#32)
Re: Optionally automatically disable logical replication subscriptions on error

On Wed, Nov 10, 2021 at 12:26 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Monday, November 8, 2021 10:15 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks for the updated patch. Please create a Commitfest entry for this. It will
help to have a look at CFBot results for the patch, also if required rebase and
post a patch on top of Head.

As requested, created a new entry for this - [1]

FYI: the skip xid patch has been updated to v20 in [2]
but the v3 for disable_on_error is not affected by this update
and still applicable with no regression.

[1] - https://commitfest.postgresql.org/36/3407/
[2] - /messages/by-id/CAD21AoAT42mhcqeB1jPfRL1+EUHbZk8MMY_fBgsyZvJeKNpG+w@mail.gmail.com

I had a look at this patch and have a couple of initial review
comments for some issues I spotted:

src/backend/commands/subscriptioncmds.c
(1) bad array entry assignment
The following code block added by the patch assigns
"values[Anum_pg_subscription_subdisableonerr - 1]" twice, resulting in
it being always set to true, rather than the specified option value:

+  if (IsSet(opts.specified_opts, SUBOPT_DISABLE_ON_ERR))
+  {
+    values[Anum_pg_subscription_subdisableonerr - 1]
+       = BoolGetDatum(opts.disableonerr);
+     values[Anum_pg_subscription_subdisableonerr - 1]
+       = true;
+  }

The 2nd line is meant to instead be
"replaces[Anum_pg_subscription_subdisableonerr - 1] = true".
(compare to handling for other similar options)

src/backend/replication/logical/worker.c
(2) unreachable code?
In the patch code there seems to be some instances of unreachable code
after re-throwing errors:

e.g.

+ /* If we caught an error above, disable the subscription */
+ if (disable_subscription)
+ {
+   ReThrowError(DisableSubscriptionOnError(cctx));
+   MemoryContextSwitchTo(ecxt);
+ }
+ else
+ {
+   PG_RE_THROW();
+   MemoryContextSwitchTo(ecxt);
+ }
+ if (disable_subscription)
+ {
+   ReThrowError(DisableSubscriptionOnError(cctx));
+   MemoryContextSwitchTo(ecxt);
+ }

I'm guessing it was intended to do the "MemoryContextSwitch(ecxt);"
before re-throwing (?), but it's not really clear, as in the 1st and
3rd cases, the DisableSubscriptionOnError() calls anyway immediately
switch the memory context to cctx.

Regards,
Greg Nancarrow
Fujitsu Australia

#34Greg Nancarrow
Greg Nancarrow
gregn4422@gmail.com
In reply to: Greg Nancarrow (#33)
Re: Optionally automatically disable logical replication subscriptions on error

On Wed, Nov 10, 2021 at 3:22 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

I had a look at this patch and have a couple of initial review
comments for some issues I spotted:

Incidentally, I found that the v3 patch only applies after the skip xid v20
patch [1] has been applied.

[2]: /messages/by-id/CAD21AoAT42mhcqeB1jPfRL1+EUHbZk8MMY_fBgsyZvJeKNpG+w@mail.gmail.com
/messages/by-id/CAD21AoAT42mhcqeB1jPfRL1+EUHbZk8MMY_fBgsyZvJeKNpG+w@mail.gmail.com

Regards,
Greg Nancarrow
Fujitsu Australia

#35osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Greg Nancarrow (#33)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Wednesday, November 10, 2021 1:23 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Wed, Nov 10, 2021 at 12:26 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Monday, November 8, 2021 10:15 PM vignesh C <vignesh21@gmail.com>

wrote:

Thanks for the updated patch. Please create a Commitfest entry for
this. It will help to have a look at CFBot results for the patch,
also if required rebase and post a patch on top of Head.

As requested, created a new entry for this - [1]

FYI: the skip xid patch has been updated to v20 in [2] but the v3 for
disable_on_error is not affected by this update and still applicable
with no regression.

[1] - https://commitfest.postgresql.org/36/3407/
[2] -

/messages/by-id/CAD21AoAT42mhcqeB1jPfRL1+
EUHbZ

k8MMY_fBgsyZvJeKNpG%2Bw%40mail.gmail.com

I had a look at this patch and have a couple of initial review comments for some
issues I spotted:

Thank you for checking it.

src/backend/commands/subscriptioncmds.c
(1) bad array entry assignment
The following code block added by the patch assigns
"values[Anum_pg_subscription_subdisableonerr - 1]" twice, resulting in it
being always set to true, rather than the specified option value:

+  if (IsSet(opts.specified_opts, SUBOPT_DISABLE_ON_ERR))  {
+    values[Anum_pg_subscription_subdisableonerr - 1]
+       = BoolGetDatum(opts.disableonerr);
+     values[Anum_pg_subscription_subdisableonerr - 1]
+       = true;
+  }

The 2nd line is meant to instead be
"replaces[Anum_pg_subscription_subdisableonerr - 1] = true".
(compare to handling for other similar options)

Oops, fixed.

src/backend/replication/logical/worker.c
(2) unreachable code?
In the patch code there seems to be some instances of unreachable code after
re-throwing errors:

e.g.

+ /* If we caught an error above, disable the subscription */ if
+ (disable_subscription) {
+   ReThrowError(DisableSubscriptionOnError(cctx));
+   MemoryContextSwitchTo(ecxt);
+ }
+ else
+ {
+   PG_RE_THROW();
+   MemoryContextSwitchTo(ecxt);
+ }
+ if (disable_subscription)
+ {
+   ReThrowError(DisableSubscriptionOnError(cctx));
+   MemoryContextSwitchTo(ecxt);
+ }

I'm guessing it was intended to do the "MemoryContextSwitch(ecxt);"
before re-throwing (?), but it's not really clear, as in the 1st and 3rd cases, the
DisableSubscriptionOnError() calls anyway immediately switch the memory
context to cctx.

You are right I think.
Fixed based on an idea below.

After an error happens, for some additional work
(e.g. to report the stats of table sync/apply worker
by pgstat_report_subworker_error() or
to update the catalog by DisableSubscriptionOnError())
restore the memory context that is used before the error (cctx)
and save the old memory context of error (ecxt). Then,
do the additional work and switch the memory context to the ecxt
just before the rethrow. As you described,
in contrast to PG_RE_THROW, DisableSubscriptionOnError() changes
the memory context immediatedly at the top of it,
so for this case, I don't call the MemoryContextSwitchTo().

Another important thing as my modification
is a case when LogicalRepApplyLoop failed and
apply_error_callback_arg.command == 0. In the original
patch of skip xid, it just calls PG_RE_THROW()
but my previous v3 codes missed this macro in this case.
Therefore, I've fixed this part as well.

C codes are checked by pgindent.

Note that this depends on the v20 skip xide patch in [1]/messages/by-id/CAD21AoAT42mhcqeB1jPfRL1+EUHbZk8MMY_fBgsyZvJeKNpG+w@mail.gmail.com

[1]: /messages/by-id/CAD21AoAT42mhcqeB1jPfRL1+EUHbZk8MMY_fBgsyZvJeKNpG+w@mail.gmail.com

Best Regards,
Takamichi Osumi

Attachments:

v4-0001-Optionally-disabling-subscriptions-on-error.patchapplication/octet-stream; name=v4-0001-Optionally-disabling-subscriptions-on-error.patch
#36vignesh C
vignesh C
vignesh21@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#35)
Re: Optionally automatically disable logical replication subscriptions on error

On Thu, Nov 11, 2021 at 2:50 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Wednesday, November 10, 2021 1:23 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Wed, Nov 10, 2021 at 12:26 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Monday, November 8, 2021 10:15 PM vignesh C <vignesh21@gmail.com>

wrote:

Thanks for the updated patch. Please create a Commitfest entry for
this. It will help to have a look at CFBot results for the patch,
also if required rebase and post a patch on top of Head.

As requested, created a new entry for this - [1]

FYI: the skip xid patch has been updated to v20 in [2] but the v3 for
disable_on_error is not affected by this update and still applicable
with no regression.

[1] - https://commitfest.postgresql.org/36/3407/
[2] -

/messages/by-id/CAD21AoAT42mhcqeB1jPfRL1+
EUHbZ

k8MMY_fBgsyZvJeKNpG%2Bw%40mail.gmail.com

I had a look at this patch and have a couple of initial review comments for some
issues I spotted:

Thank you for checking it.

src/backend/commands/subscriptioncmds.c
(1) bad array entry assignment
The following code block added by the patch assigns
"values[Anum_pg_subscription_subdisableonerr - 1]" twice, resulting in it
being always set to true, rather than the specified option value:

+  if (IsSet(opts.specified_opts, SUBOPT_DISABLE_ON_ERR))  {
+    values[Anum_pg_subscription_subdisableonerr - 1]
+       = BoolGetDatum(opts.disableonerr);
+     values[Anum_pg_subscription_subdisableonerr - 1]
+       = true;
+  }

The 2nd line is meant to instead be
"replaces[Anum_pg_subscription_subdisableonerr - 1] = true".
(compare to handling for other similar options)

Oops, fixed.

src/backend/replication/logical/worker.c
(2) unreachable code?
In the patch code there seems to be some instances of unreachable code after
re-throwing errors:

e.g.

+ /* If we caught an error above, disable the subscription */ if
+ (disable_subscription) {
+   ReThrowError(DisableSubscriptionOnError(cctx));
+   MemoryContextSwitchTo(ecxt);
+ }
+ else
+ {
+   PG_RE_THROW();
+   MemoryContextSwitchTo(ecxt);
+ }
+ if (disable_subscription)
+ {
+   ReThrowError(DisableSubscriptionOnError(cctx));
+   MemoryContextSwitchTo(ecxt);
+ }

I'm guessing it was intended to do the "MemoryContextSwitch(ecxt);"
before re-throwing (?), but it's not really clear, as in the 1st and 3rd cases, the
DisableSubscriptionOnError() calls anyway immediately switch the memory
context to cctx.

You are right I think.
Fixed based on an idea below.

After an error happens, for some additional work
(e.g. to report the stats of table sync/apply worker
by pgstat_report_subworker_error() or
to update the catalog by DisableSubscriptionOnError())
restore the memory context that is used before the error (cctx)
and save the old memory context of error (ecxt). Then,
do the additional work and switch the memory context to the ecxt
just before the rethrow. As you described,
in contrast to PG_RE_THROW, DisableSubscriptionOnError() changes
the memory context immediatedly at the top of it,
so for this case, I don't call the MemoryContextSwitchTo().

Another important thing as my modification
is a case when LogicalRepApplyLoop failed and
apply_error_callback_arg.command == 0. In the original
patch of skip xid, it just calls PG_RE_THROW()
but my previous v3 codes missed this macro in this case.
Therefore, I've fixed this part as well.

C codes are checked by pgindent.

Note that this depends on the v20 skip xide patch in [1]

Thanks for the updated patch, Few comments:
1) tab completion should be added for disable_on_error:
/* Complete "CREATE SUBSCRIPTION <name> ... WITH ( <opt>" */
else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
"enabled", "slot_name", "streaming",
"synchronous_commit", "two_phase");

2) disable_on_error is supported by alter subscription, the same
should be documented:
@ -871,11 +886,19 @@ AlterSubscription(ParseState *pstate,
AlterSubscriptionStmt *stmt,
{
supported_opts = (SUBOPT_SLOT_NAME |

SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
-
SUBOPT_STREAMING);
+
SUBOPT_STREAMING | SUBOPT_DISABLE_ON_ERR);

parse_subscription_options(pstate,
stmt->options,

supported_opts, &opts);

+                               if (IsSet(opts.specified_opts,
SUBOPT_DISABLE_ON_ERR))
+                               {
+
values[Anum_pg_subscription_subdisableonerr - 1]
+                                               =
BoolGetDatum(opts.disableonerr);
+
replaces[Anum_pg_subscription_subdisableonerr - 1]
+                                               = true;
+                               }
+

3) Describe subscriptions (dRs+) should include displaying of disableonerr:
\dRs+ sub1
List of subscriptions
Name | Owner | Enabled | Publication | Binary | Streaming | Two
phase commit | Synchronous commit | Conninfo
------+---------+---------+-------------+--------+-----------+------------------+--------------------+---------------------------
sub1 | vignesh | t | {pub1} | f | f | d
| off | dbname=postgres port=5432
(1 row)

4) I felt transicent should be transient, might be a typo:
+          Specifies whether the subscription should be automatically disabled
+          if replicating data from the publisher triggers non-transicent errors
+          such as referential integrity or permissions errors. The default is
+          <literal>false</literal>.
5) The commented use PostgresNode and use TestLib can be removed:
+# Test of logical replication subscription self-disabling feature
+use strict;
+use warnings;
+# use PostgresNode;
+# use TestLib;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 10;

Regards,
Vignesh

#37Greg Nancarrow
Greg Nancarrow
gregn4422@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#35)
Re: Optionally automatically disable logical replication subscriptions on error

On Thu, Nov 11, 2021 at 8:20 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

C codes are checked by pgindent.

Note that this depends on the v20 skip xide patch in [1]

Some comments on the v4 patch:

(1) Patch subject
I think the patch subject should say "disable" instead of "disabling":
Optionally disable subscriptions on error

doc/src/sgml/ref/create_subscription.sgml
(2) spelling mistake
+ if replicating data from the publisher triggers non-transicent errors

non-transicent -> non-transient

(I notice Vignesh also pointed this out)

src/backend/replication/logical/worker.c
(3) calling geterrcode()
The new IsSubscriptionDisablingError() function calls geterrcode().
According to the function comment for geterrcode(), it is only
intended for use in error callbacks.
Instead of calling geterrcode(), couldn't the ErrorData from PG_CATCH
block be passed to IsSubscriptionDisablingError() instead (from which
it can get the sqlerrcode)?

(4) DisableSubscriptionOnError
DisableSubscriptionOnError() is again calling MemoryContextSwitch()
and CopyErrorData().
I think the ErrorData from the PG_CATCH block could simply be passed
to DisableSubscriptionOnError() instead of the memory-context, and the
existing MemoryContextSwitch() and CopyErrorData() calls could be
removed from it.

AFAICS, applying (3) and (4) above would make the code a lot cleaner.

Regards,
Greg Nancarrow
Fujitsu Australia

#38osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: vignesh C (#36)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Friday, November 12, 2021 1:09 PM vignesh C <vignesh21@gmail.com> wrote:

On Thu, Nov 11, 2021 at 2:50 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
Thanks for the updated patch, Few comments:
1) tab completion should be added for disable_on_error:
/* Complete "CREATE SUBSCRIPTION <name> ... WITH ( <opt>" */ else if
(HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
"enabled", "slot_name", "streaming",
"synchronous_commit", "two_phase");

Fixed.

2) disable_on_error is supported by alter subscription, the same should be
documented:
@ -871,11 +886,19 @@ AlterSubscription(ParseState *pstate,
AlterSubscriptionStmt *stmt,
{
supported_opts = (SUBOPT_SLOT_NAME |

SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
-
SUBOPT_STREAMING);
+
SUBOPT_STREAMING | SUBOPT_DISABLE_ON_ERR);

parse_subscription_options(pstate,
stmt->options,

supported_opts, &opts);

+                               if (IsSet(opts.specified_opts,
SUBOPT_DISABLE_ON_ERR))
+                               {
+
values[Anum_pg_subscription_subdisableonerr - 1]
+                                               =
BoolGetDatum(opts.disableonerr);
+
replaces[Anum_pg_subscription_subdisableonerr - 1]
+                                               = true;
+                               }
+

Fixed the documentation. Also, add one test for alter subscription.

3) Describe subscriptions (dRs+) should include displaying of disableonerr:
\dRs+ sub1
List of subscriptions
Name | Owner | Enabled | Publication | Binary | Streaming | Two
phase commit | Synchronous commit | Conninfo
------+---------+---------+-------------+--------+-----------+----------
--------+--------------------+---------------------------
sub1 | vignesh | t | {pub1} | f | f | d
| off | dbname=postgres port=5432
(1 row)

Fixed.

4) I felt transicent should be transient, might be a typo:
+          Specifies whether the subscription should be automatically
disabled
+          if replicating data from the publisher triggers non-transicent errors
+          such as referential integrity or permissions errors. The default is
+          <literal>false</literal>.

Fixed.

5) The commented use PostgresNode and use TestLib can be removed:
+# Test of logical replication subscription self-disabling feature use
+strict; use warnings; # use PostgresNode; # use TestLib; use
+PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More
+tests => 10;

Removed.

Also, my colleague Greg provided an offlist patch to me and
I've incorporated his suggested modifications into this version.
So, I noted his name as a coauthor.

C codes are checked by pgindent again.

This v5 depends on v23 skip xid in [1]/messages/by-id/CAD21AoA5jupM6O=pYsyfaxQ1aMX-en8=QNgpW6KfXsg7_CS0CQ@mail.gmail.com.

[1]: /messages/by-id/CAD21AoA5jupM6O=pYsyfaxQ1aMX-en8=QNgpW6KfXsg7_CS0CQ@mail.gmail.com

Best Regards,
Takamichi Osumi

Attachments:

v5-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v5-0001-Optionally-disable-subscriptions-on-error.patch
#39osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Greg Nancarrow (#37)
RE: Optionally automatically disable logical replication subscriptions on error

Thank you for checking the patch !

On Friday, November 12, 2021 1:49 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Thu, Nov 11, 2021 at 8:20 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
Some comments on the v4 patch:

(1) Patch subject
I think the patch subject should say "disable" instead of "disabling":
Optionally disable subscriptions on error

Fixed.

doc/src/sgml/ref/create_subscription.sgml
(2) spelling mistake
+          if replicating data from the publisher triggers
+ non-transicent errors

non-transicent -> non-transient

Fixed.

(I notice Vignesh also pointed this out)

src/backend/replication/logical/worker.c
(3) calling geterrcode()
The new IsSubscriptionDisablingError() function calls geterrcode().
According to the function comment for geterrcode(), it is only intended for use
in error callbacks.
Instead of calling geterrcode(), couldn't the ErrorData from PG_CATCH block be
passed to IsSubscriptionDisablingError() instead (from which it can get the
sqlerrcode)?

(4) DisableSubscriptionOnError
DisableSubscriptionOnError() is again calling MemoryContextSwitch() and
CopyErrorData().
I think the ErrorData from the PG_CATCH block could simply be passed to
DisableSubscriptionOnError() instead of the memory-context, and the existing
MemoryContextSwitch() and CopyErrorData() calls could be removed from it.

AFAICS, applying (3) and (4) above would make the code a lot cleaner.

Fixed.

The updated patch is shared in [1]/messages/by-id/TYCPR01MB8373771371B31E1E6CC74B0AED999@TYCPR01MB8373.jpnprd01.prod.outlook.com.

[1]: /messages/by-id/TYCPR01MB8373771371B31E1E6CC74B0AED999@TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi

#40Greg Nancarrow
Greg Nancarrow
gregn4422@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#38)
Re: Optionally automatically disable logical replication subscriptions on error

On Tue, Nov 16, 2021 at 6:53 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

This v5 depends on v23 skip xid in [1].

A minor comment:

doc/src/sgml/ref/alter_subscription.sgml
(1) disable_on_err?

+ <literal>disable_on_err</literal>.

This doc update names the new parameter as "disable_on_err" instead of
"disable_on_error".
Also "disable_on_err" appears in a couple of the test case comments.

Regards,
Greg Nancarrow
Fujitsu Australia

#41osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Greg Nancarrow (#40)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Thursday, November 18, 2021 2:08 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

A minor comment:

Thanks for your comments !

doc/src/sgml/ref/alter_subscription.sgml
(1) disable_on_err?

+ <literal>disable_on_err</literal>.

This doc update names the new parameter as "disable_on_err" instead of
"disable_on_error".
Also "disable_on_err" appears in a couple of the test case comments.

Fixed all 3 places.

At the same time, I changed one function name
from IsSubscriptionDisablingError() to IsTransientError()
so that it can express what it checks correctly.
Of course, the return value of true or false
becomes reverse by this name change, but
This would make the function more general.
Also, its comments were fixed.

This version also depends on the v23 of skip xid [1]/messages/by-id/CAD21AoA5jupM6O=pYsyfaxQ1aMX-en8=QNgpW6KfXsg7_CS0CQ@mail.gmail.com

[1]: /messages/by-id/CAD21AoA5jupM6O=pYsyfaxQ1aMX-en8=QNgpW6KfXsg7_CS0CQ@mail.gmail.com

Best Regards,
Takamichi Osumi

Attachments:

v6-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v6-0001-Optionally-disable-subscriptions-on-error.patch
#42vignesh C
vignesh C
vignesh21@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#41)
Re: Optionally automatically disable logical replication subscriptions on error

On Thu, Nov 18, 2021 at 12:52 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Thursday, November 18, 2021 2:08 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

A minor comment:

Thanks for your comments !

doc/src/sgml/ref/alter_subscription.sgml
(1) disable_on_err?

+ <literal>disable_on_err</literal>.

This doc update names the new parameter as "disable_on_err" instead of
"disable_on_error".
Also "disable_on_err" appears in a couple of the test case comments.

Fixed all 3 places.

At the same time, I changed one function name
from IsSubscriptionDisablingError() to IsTransientError()
so that it can express what it checks correctly.
Of course, the return value of true or false
becomes reverse by this name change, but
This would make the function more general.
Also, its comments were fixed.

This version also depends on the v23 of skip xid [1]

Few comments:
1) Changes to handle pg_dump are missing. It should be done in
dumpSubscription and getSubscriptions

2) "And" is missing
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -201,8 +201,8 @@ ALTER SUBSCRIPTION <replaceable
class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,<literal>streaming</literal>
+      <literal>disable_on_error</literal>.
Should be:
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,<literal>streaming</literal>, and
+      <literal>disable_on_error</literal>.
3) Should we change this :
+          Specifies whether the subscription should be automatically disabled
+          if replicating data from the publisher triggers non-transient errors
+          such as referential integrity or permissions errors. The default is
+          <literal>false</literal>.
to:
+          Specifies whether the subscription should be automatically disabled
+          while replicating data from the publisher triggers
non-transient errors
+          such as referential integrity, permissions errors, etc. The
default is
+          <literal>false</literal>.

Regards,
Vignesh

#43osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: vignesh C (#42)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Monday, November 22, 2021 3:53 PM vignesh C <vignesh21@gmail.com> wrote:

Few comments:

Thank you so much for your review !

1) Changes to handle pg_dump are missing. It should be done in
dumpSubscription and getSubscriptions

Fixed.

2) "And" is missing
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -201,8 +201,8 @@ ALTER SUBSCRIPTION <replaceable
class="parameter">name</replaceable> RENAME TO <
information.  The parameters that can be altered
are <literal>slot_name</literal>,
<literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,<literal>streaming</literal>
+      <literal>disable_on_error</literal>.
Should be:
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,<literal>streaming</literal>, and
+      <literal>disable_on_error</literal>.

Fixed.

3) Should we change this :
+          Specifies whether the subscription should be automatically
disabled
+          if replicating data from the publisher triggers non-transient errors
+          such as referential integrity or permissions errors. The default is
+          <literal>false</literal>.
to:
+          Specifies whether the subscription should be automatically
disabled
+          while replicating data from the publisher triggers
non-transient errors
+          such as referential integrity, permissions errors, etc. The
default is
+          <literal>false</literal>.

I preferred the previous description. The option
"disable_on_error" works with even one error.
If we use "while", the nuance would be like
we keep disabling a subscription more than once.
This situation happens only when user makes
the subscription enable without resolving the non-transient error,
which sounds a bit unnatural. So, I wanna keep the previous description.
If you are not satisfied with this, kindly let me know.

This v7 uses v26 of skip xid patch [1]/messages/by-id/CAD21AoDNe_O+CPucd_jQPu3gGGaCLNP+J_kSPNecTdAM8HFPww@mail.gmail.com

[1]: /messages/by-id/CAD21AoDNe_O+CPucd_jQPu3gGGaCLNP+J_kSPNecTdAM8HFPww@mail.gmail.com

Best Regards,
Takamichi Osumi

Attachments:

v7-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v7-0001-Optionally-disable-subscriptions-on-error.patch
#44vignesh C
vignesh C
vignesh21@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#43)
Re: Optionally automatically disable logical replication subscriptions on error

On Fri, Nov 26, 2021 at 8:06 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Monday, November 22, 2021 3:53 PM vignesh C <vignesh21@gmail.com> wrote:

Few comments:

Thank you so much for your review !

1) Changes to handle pg_dump are missing. It should be done in
dumpSubscription and getSubscriptions

Fixed.

2) "And" is missing
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -201,8 +201,8 @@ ALTER SUBSCRIPTION <replaceable
class="parameter">name</replaceable> RENAME TO <
information.  The parameters that can be altered
are <literal>slot_name</literal>,
<literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,<literal>streaming</literal>
+      <literal>disable_on_error</literal>.
Should be:
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,<literal>streaming</literal>, and
+      <literal>disable_on_error</literal>.

Fixed.

3) Should we change this :
+          Specifies whether the subscription should be automatically
disabled
+          if replicating data from the publisher triggers non-transient errors
+          such as referential integrity or permissions errors. The default is
+          <literal>false</literal>.
to:
+          Specifies whether the subscription should be automatically
disabled
+          while replicating data from the publisher triggers
non-transient errors
+          such as referential integrity, permissions errors, etc. The
default is
+          <literal>false</literal>.

I preferred the previous description. The option
"disable_on_error" works with even one error.
If we use "while", the nuance would be like
we keep disabling a subscription more than once.
This situation happens only when user makes
the subscription enable without resolving the non-transient error,
which sounds a bit unnatural. So, I wanna keep the previous description.
If you are not satisfied with this, kindly let me know.

This v7 uses v26 of skip xid patch [1]

Thanks for the updated patch, Few comments:
1) Since this function is used only from 027_disable_on_error and not
used by others, this can be moved to 027_disable_on_error:
+sub wait_for_subscriptions
+{
+       my ($self, $dbname, @subscriptions) = @_;
+
+       # Unique-ify the subscriptions passed by the caller
+       my %unique = map { $_ => 1 } @subscriptions;
+       my @unique = sort keys %unique;
+       my $unique_count = scalar(@unique);
+
+       # Construct a SQL list from the unique subscription names
+       my @escaped = map { s/'/''/g; s/\\/\\\\/g; $_ } @unique;
+       my $sublist = join(', ', map { "'$_'" } @escaped);
+
+       my $polling_sql = qq(
+               SELECT COUNT(1) = $unique_count FROM
+                       (SELECT s.oid
+                               FROM pg_catalog.pg_subscription s
+                               LEFT JOIN pg_catalog.pg_subscription_rel sr
+                               ON sr.srsubid = s.oid
+                               WHERE (sr IS NULL OR sr.srsubstate IN
('s', 'r'))
+                                 AND s.subname IN ($sublist)
+                                 AND s.subenabled IS TRUE
+                        UNION
+                        SELECT s.oid
+                               FROM pg_catalog.pg_subscription s
+                               WHERE s.subname IN ($sublist)
+                                 AND s.subenabled IS FALSE
+                       ) AS synced_or_disabled
+               );
+       return $self->poll_query_until($dbname, $polling_sql);
+}
2) The empty line after comment is not required, it can be removed
+# Create non-unique data in both schemas on the publisher.
+#
+for $schema (@schemas)
+{
3) Similarly it can be changed across the file
+# Wait for the initial subscription synchronizations to finish or fail.
+#
+$node_subscriber->wait_for_subscriptions('postgres', @schemas)
+       or die "Timed out while waiting for subscriber to synchronize data";
+# Enter unique data for both schemas on the publisher.  This should succeed on
+# the publisher node, and not cause any additional problems on the subscriber
+# side either, though disabled subscription "s1" should not replicate anything.
+#
+for $schema (@schemas)
4) Since subid is used only at one place, no need of subid variable,
you could replace subid with subform->oid in LockSharedObject
+       Datum           values[Natts_pg_subscription];
+       HeapTuple       tup;
+       Oid                     subid;
+       Form_pg_subscription subform;
+       subid = subform->oid;
+       LockSharedObject(SubscriptionRelationId, subid, 0, AccessExclusiveLock);
5) "permissions errors" should be "permission errors"
+          Specifies whether the subscription should be automatically disabled
+          if replicating data from the publisher triggers non-transient errors
+          such as referential integrity or permissions errors. The default is
+          <literal>false</literal>.
+         </para>

Regards,
Vignesh

#45Greg Nancarrow
Greg Nancarrow
gregn4422@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#43)
Re: Optionally automatically disable logical replication subscriptions on error

On Sat, Nov 27, 2021 at 1:36 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

This v7 uses v26 of skip xid patch [1]

This patch no longer applies on the latest source.
Also, the patch is missing an update to doc/src/sgml/catalogs.sgml,
for the new "subdisableonerr" column of pg_subscription.

Regards,
Greg Nancarrow
Fujitsu Australia

#46osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Greg Nancarrow (#45)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Tuesday, November 30, 2021 1:10 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Sat, Nov 27, 2021 at 1:36 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

This v7 uses v26 of skip xid patch [1]

This patch no longer applies on the latest source.
Also, the patch is missing an update to doc/src/sgml/catalogs.sgml, for the
new "subdisableonerr" column of pg_subscription.

Thanks for your review !

Fixed the documentation accordingly. Further,
this comment invoked some more refactoring of codes
since I wrote some internal codes related to
'disable_on_error' in an inconsistent order.
I fixed this by keeping patch's codes
after that of 'two_phase' subscription option as much as possible.

I also conducted both pgindent and pgperltidy.

Now, I'll share the v8 that uses PG
whose commit id is after 8d74fc9 (pg_stat_subscription_workers).

Best Regards,
Takamichi Osumi

Attachments:

v8-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v8-0001-Optionally-disable-subscriptions-on-error.patch
#47osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: vignesh C (#44)
RE: Optionally automatically disable logical replication subscriptions on error

On Monday, November 29, 2021 2:38 PM vignesh C <vignesh21@gmail.com>

Thanks for the updated patch, Few comments:

Thank you for your review !

1) Since this function is used only from 027_disable_on_error and not used by
others, this can be moved to 027_disable_on_error:
+sub wait_for_subscriptions
+{
+       my ($self, $dbname, @subscriptions) = @_;
+
+       # Unique-ify the subscriptions passed by the caller
+       my %unique = map { $_ => 1 } @subscriptions;
+       my @unique = sort keys %unique;
+       my $unique_count = scalar(@unique);
+
+       # Construct a SQL list from the unique subscription names
+       my @escaped = map { s/'/''/g; s/\\/\\\\/g; $_ } @unique;
+       my $sublist = join(', ', map { "'$_'" } @escaped);
+
+       my $polling_sql = qq(
+               SELECT COUNT(1) = $unique_count FROM
+                       (SELECT s.oid
+                               FROM pg_catalog.pg_subscription s
+                               LEFT JOIN pg_catalog.pg_subscription_rel
sr
+                               ON sr.srsubid = s.oid
+                               WHERE (sr IS NULL OR sr.srsubstate IN
('s', 'r'))
+                                 AND s.subname IN ($sublist)
+                                 AND s.subenabled IS TRUE
+                        UNION
+                        SELECT s.oid
+                               FROM pg_catalog.pg_subscription s
+                               WHERE s.subname IN ($sublist)
+                                 AND s.subenabled IS FALSE
+                       ) AS synced_or_disabled
+               );
+       return $self->poll_query_until($dbname, $polling_sql); }

Fixed.

2) The empty line after comment is not required, it can be removed
+# Create non-unique data in both schemas on the publisher.
+#
+for $schema (@schemas)
+{

Fixed.

3) Similarly it can be changed across the file
+# Wait for the initial subscription synchronizations to finish or fail.
+#
+$node_subscriber->wait_for_subscriptions('postgres', @schemas)
+       or die "Timed out while waiting for subscriber to synchronize
+data";
+# Enter unique data for both schemas on the publisher.  This should
+succeed on # the publisher node, and not cause any additional problems
+on the subscriber # side either, though disabled subscription "s1" should not
replicate anything.
+#
+for $schema (@schemas)

Fixed.

4) Since subid is used only at one place, no need of subid variable, you could
replace subid with subform->oid in LockSharedObject
+       Datum           values[Natts_pg_subscription];
+       HeapTuple       tup;
+       Oid                     subid;
+       Form_pg_subscription subform;
+       subid = subform->oid;
+       LockSharedObject(SubscriptionRelationId, subid, 0,
+ AccessExclusiveLock);

Fixed.

5) "permissions errors" should be "permission errors"
+          Specifies whether the subscription should be automatically
disabled
+          if replicating data from the publisher triggers non-transient errors
+          such as referential integrity or permissions errors. The default is
+          <literal>false</literal>.
+         </para>

Fixed.

The new patch v8 is shared in [1]/messages/by-id/TYCPR01MB83735AA021E0F614A3AB3221ED679@TYCPR01MB8373.jpnprd01.prod.outlook.com.

[1]: /messages/by-id/TYCPR01MB83735AA021E0F614A3AB3221ED679@TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi

#48vignesh C
vignesh C
vignesh21@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#46)
Re: Optionally automatically disable logical replication subscriptions on error

On Tue, Nov 30, 2021 at 5:34 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Tuesday, November 30, 2021 1:10 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Sat, Nov 27, 2021 at 1:36 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

This v7 uses v26 of skip xid patch [1]

This patch no longer applies on the latest source.
Also, the patch is missing an update to doc/src/sgml/catalogs.sgml, for the
new "subdisableonerr" column of pg_subscription.

Thanks for your review !

Fixed the documentation accordingly. Further,
this comment invoked some more refactoring of codes
since I wrote some internal codes related to
'disable_on_error' in an inconsistent order.
I fixed this by keeping patch's codes
after that of 'two_phase' subscription option as much as possible.

I also conducted both pgindent and pgperltidy.

Now, I'll share the v8 that uses PG
whose commit id is after 8d74fc9 (pg_stat_subscription_workers).

Thanks for the updated patch, few small comments:
1) This should be changed:
+       <structfield>subdisableonerr</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the subscription will be disabled when subscription
+       worker detects an error
+      </para></entry>
+     </row>
to:
+       <structfield>subdisableonerr</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the subscription will be disabled when subscription
+       worker detects non-transient errors
+      </para></entry>
+     </row>
2) "Disable On Err" can be changed to "Disable On Error"
+                                                         ",
subtwophasestate AS \"%s\"\n"
+                                                         ",
subdisableonerr AS \"%s\"\n",
+
gettext_noop("Two phase commit"),
+
gettext_noop("Disable On Err"));

3) Can add a line in the commit message saying "Bump catalog version."
as the patch involves changing the catalog.

4) This prototype is not required, since the function is called after
the function definition:
static inline void set_apply_error_context_xact(TransactionId xid,
TimestampTz ts);
static inline void reset_apply_error_context_info(void);
+static bool IsTransientError(ErrorData *edata);

5) we could use the new style here:
+       ereport(LOG,
+                       (errmsg("logical replication subscription
\"%s\" will be disabled due to error: %s",
+                                       MySubscription->name, edata->message)));
change it to:
+       ereport(LOG,
+                       errmsg("logical replication subscription
\"%s\" will be disabled due to error: %s",
+                                       MySubscription->name, edata->message));

Similarly it can be changed in the other ereports added.

Regards,
Vignesh

#49osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: vignesh C (#48)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Wednesday, December 1, 2021 3:02 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, Nov 30, 2021 at 5:34 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Tuesday, November 30, 2021 1:10 PM Greg Nancarrow

<gregn4422@gmail.com> wrote:

On Sat, Nov 27, 2021 at 1:36 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

This v7 uses v26 of skip xid patch [1]

This patch no longer applies on the latest source.
Also, the patch is missing an update to doc/src/sgml/catalogs.sgml,
for the new "subdisableonerr" column of pg_subscription.

Thanks for your review !

Fixed the documentation accordingly. Further, this comment invoked
some more refactoring of codes since I wrote some internal codes
related to 'disable_on_error' in an inconsistent order.
I fixed this by keeping patch's codes
after that of 'two_phase' subscription option as much as possible.

I also conducted both pgindent and pgperltidy.

Now, I'll share the v8 that uses PG
whose commit id is after 8d74fc9 (pg_stat_subscription_workers).

Thanks for the updated patch, few small comments:

I appreciate your check.

1) This should be changed:
+       <structfield>subdisableonerr</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the subscription will be disabled when subscription
+       worker detects an error
+      </para></entry>
+     </row>
to:
+       <structfield>subdisableonerr</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the subscription will be disabled when subscription
+       worker detects non-transient errors
+      </para></entry>
+     </row>

Fixed. Actually, there's no clear definition what "non-transient" means
in the documentation. So, I added some words to your suggestion,
which would give clearer understanding to users.

2) "Disable On Err" can be changed to "Disable On Error"
+                                                         ",
subtwophasestate AS \"%s\"\n"
+                                                         ",
subdisableonerr AS \"%s\"\n",
+
gettext_noop("Two phase commit"),
+
gettext_noop("Disable On Err"));

Fixed.

3) Can add a line in the commit message saying "Bump catalog version."
as the patch involves changing the catalog.

Hmm, let me postpone this fix till the final version.
The catalog version gets easily updated by other patch commits
and including it in the middle of development can become
cause of conflicts of my patch when applied to the PG,
which is possible to make other reviewers stop reviewing.

4) This prototype is not required, since the function is called after the function
definition:
static inline void set_apply_error_context_xact(TransactionId xid,
TimestampTz ts); static inline void reset_apply_error_context_info(void);
+static bool IsTransientError(ErrorData *edata);

Fixed.

5) we could use the new style here:
+       ereport(LOG,
+                       (errmsg("logical replication subscription
\"%s\" will be disabled due to error: %s",
+                                       MySubscription->name,
+ edata->message)));
change it to:
+       ereport(LOG,
+                       errmsg("logical replication subscription
\"%s\" will be disabled due to error: %s",
+                                       MySubscription->name,
+ edata->message));

Similarly it can be changed in the other ereports added.

Removed the unnecessary parentheses.

Best Regards,
Takamichi Osumi

Attachments:

v9-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v9-0001-Optionally-disable-subscriptions-on-error.patch
#50Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#49)
Re: Optionally automatically disable logical replication subscriptions on error

On Wed, Dec 1, 2021 at 5:55 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Wednesday, December 1, 2021 3:02 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, Nov 30, 2021 at 5:34 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

3) Can add a line in the commit message saying "Bump catalog version."
as the patch involves changing the catalog.

Hmm, let me postpone this fix till the final version.
The catalog version gets easily updated by other patch commits
and including it in the middle of development can become
cause of conflicts of my patch when applied to the PG,
which is possible to make other reviewers stop reviewing.

Vignesh seems to be suggesting just changing the commit message, not
the actual code. This is sort of a reminder to the committer to change
the catversion before pushing the patch. So that shouldn't cause any
conflicts while applying your patch.

--
With Regards,
Amit Kapila.

#51osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#50)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Wednesday, December 1, 2021 10:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 1, 2021 at 5:55 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Wednesday, December 1, 2021 3:02 PM vignesh C

<vignesh21@gmail.com> wrote:

On Tue, Nov 30, 2021 at 5:34 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

3) Can add a line in the commit message saying "Bump catalog version."
as the patch involves changing the catalog.

Hmm, let me postpone this fix till the final version.
The catalog version gets easily updated by other patch commits and
including it in the middle of development can become cause of
conflicts of my patch when applied to the PG, which is possible to
make other reviewers stop reviewing.

Vignesh seems to be suggesting just changing the commit message, not the
actual code. This is sort of a reminder to the committer to change the catversion
before pushing the patch. So that shouldn't cause any conflicts while applying
your patch.

Ah, sorry for my misunderstanding.
Updated the patch to include the notification.

Best Regards,
Takamichi Osumi

Attachments:

v10-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v10-0001-Optionally-disable-subscriptions-on-error.patch
#52Greg Nancarrow
Greg Nancarrow
gregn4422@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#51)
Re: Optionally automatically disable logical replication subscriptions on error

On Thu, Dec 2, 2021 at 12:05 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Updated the patch to include the notification.

For the catalogs.sgml update, I was thinking that the following
wording might sound a bit better:

+       If true, the subscription will be disabled when a subscription
+       worker detects non-transient errors (e.g. duplication error)
+       that require user intervention to resolve.

What do you think?

Regards,
Greg Nancarrow
Fujitsu Australia

#53Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#51)
Re: Optionally automatically disable logical replication subscriptions on error

On Thu, Dec 2, 2021 at 6:35 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Wednesday, December 1, 2021 10:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Updated the patch to include the notification.

The patch disables the subscription for non-transient errors. I am not
sure if we can easily make the call to decide whether any particular
error is transient or not. For example, DISK_FULL or OUT_OF_MEMORY
might not rectify itself. Why not just allow to disable the
subscription on any error? And then let the user check the error
either in view or logs and decide whether it would like to enable the
subscription or do something before it (like making space in disk, or
fixing the network).

The other problem I see with this transient error stuff is maintaining
the list of error codes that we think are transient. I think we need a
discussion for each of the error_codes we are listing now and whatever
new error_code we add in the future which doesn't seem like a good
idea.

I think the code to deal with apply worker errors and then disable the
subscription has some flaws. Say, while disabling the subscription if
it leads to another error then I think the original error won't be
reported. Can't we simply emit the error via EmitErrorReport and then
do AbortOutOfAnyTransaction, FlushErrorState, and any other memory
context clean up if required and then disable the subscription after
coming out of catch?

--
With Regards,
Amit Kapila.

#54osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#53)
RE: Optionally automatically disable logical replication subscriptions on error

On Thursday, December 2, 2021 1:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 2, 2021 at 6:35 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Wednesday, December 1, 2021 10:16 PM Amit Kapila

<amit.kapila16@gmail.com> wrote:

Updated the patch to include the notification.

The patch disables the subscription for non-transient errors. I am not sure if we
can easily make the call to decide whether any particular error is transient or
not. For example, DISK_FULL or OUT_OF_MEMORY might not rectify itself.
Why not just allow to disable the subscription on any error? And then let the
user check the error either in view or logs and decide whether it would like to
enable the subscription or do something before it (like making space in disk, or
fixing the network).

Agreed. I'll treat any errors as the trigger of the feature
in the next version.

The other problem I see with this transient error stuff is maintaining the list of
error codes that we think are transient. I think we need a discussion for each of
the error_codes we are listing now and whatever new error_code we add in the
future which doesn't seem like a good idea.

This is also true. The maintenance cost of my current implementation
didn't sound cheap.

I think the code to deal with apply worker errors and then disable the
subscription has some flaws. Say, while disabling the subscription if it leads to
another error then I think the original error won't be reported. Can't we simply
emit the error via EmitErrorReport and then do AbortOutOfAnyTransaction,
FlushErrorState, and any other memory context clean up if required and then
disable the subscription after coming out of catch?

You are right. I'll fix related parts accordingly.

Best Regards,
Takamichi Osumi

#55osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#54)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

Thursday, December 2, 2021 4:41 PM I wrote:

On Thursday, December 2, 2021 1:49 PM Amit Kapila
<amit.kapila16@gmail.com> wrote:

On Thu, Dec 2, 2021 at 6:35 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Wednesday, December 1, 2021 10:16 PM Amit Kapila

<amit.kapila16@gmail.com> wrote:

Updated the patch to include the notification.

The patch disables the subscription for non-transient errors. I am not
sure if we can easily make the call to decide whether any particular
error is transient or not. For example, DISK_FULL or OUT_OF_MEMORY

might not rectify itself.

Why not just allow to disable the subscription on any error? And then
let the user check the error either in view or logs and decide whether
it would like to enable the subscription or do something before it
(like making space in disk, or fixing the network).

Agreed. I'll treat any errors as the trigger of the feature in the next version.

The other problem I see with this transient error stuff is maintaining
the list of error codes that we think are transient. I think we need a
discussion for each of the error_codes we are listing now and whatever
new error_code we add in the future which doesn't seem like a good idea.

This is also true. The maintenance cost of my current implementation didn't
sound cheap.

I think the code to deal with apply worker errors and then disable the
subscription has some flaws. Say, while disabling the subscription if
it leads to another error then I think the original error won't be
reported. Can't we simply emit the error via EmitErrorReport and then
do AbortOutOfAnyTransaction, FlushErrorState, and any other memory
context clean up if required and then disable the subscription after coming

out of catch?
You are right. I'll fix related parts accordingly.

Hi, I've made a new patch v11 that incorporated suggestions described above.

There are several notes to share regarding v11 modifications.

1. Modified the commit message a bit.

2. DisableSubscriptionOnError() doesn't return ErrData anymore,
since now to emit error message is done in the error recovery area
and the function purpose has become purely to run a transaction to disable
the subscription.

3. In DisableSubscriptionOnError(), v10 rethrew the error if the disable_on_error
flag became false in the interim, but v11 just closes the transaction and
finishes the function.

4. If table sync worker detects an error during synchronization
and needs to disable the subscription, the worker disables it and just exit by proc_exit.
The processing after disabling the subscription didn't look necessary to me
for disabled subscription.

5. Only when we succeed in the table synchronization, it's necessary to
allocate slot name in long-lived context, after the table synchronization in
ApplyWorkerMain(). Otherwise, we'll see junk value of syncslotname
because it is the return value of LogicalRepSyncTableStart().

6. There are 3 places for error recovery in ApplyWorkerMain().
All of those might look similar but I didn't make an united function for them.
Those are slightly different from each other and I felt
readability is reduced by grouping them into one type of function call.

7. In v11, I covered the case that apply worker failed with
apply_error_callback_arg.command == 0, as one path to disable the subscription
in order to cover all errors.

8. I changed one flag name from 'disable_subscription' to 'did_error'
in ApplyWorkerMain().

9. All chages in this version are C codes and checked by pgindent.

Best Regards,
Takamichi Osumi

Attachments:

v11-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v11-0001-Optionally-disable-subscriptions-on-error.patch
#56Greg Nancarrow
Greg Nancarrow
gregn4422@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#55)
Re: Optionally automatically disable logical replication subscriptions on error

On Sat, Dec 4, 2021 at 12:20 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Hi, I've made a new patch v11 that incorporated suggestions described above.

Some review comments for the v11 patch:

doc/src/sgml/ref/create_subscription.sgml
(1) Possible wording improvement?

BEFORE:
+  Specifies whether the subscription should be automatically disabled
+  if replicating data from the publisher triggers errors. The default
+  is <literal>false</literal>.
AFTER:
+  Specifies whether the subscription should be automatically disabled
+  if any errors are detected by subscription workers during data
+  replication from the publisher. The default is <literal>false</literal>.

src/backend/replication/logical/worker.c
(2) WorkerErrorRecovery comments
Instead of:

+ * As a preparation for disabling the subscription, emit the error,
+ * handle the transaction and clean up the memory context of
+ * error. ErrorContext is reset by FlushErrorState.

why not just say:

+ Worker error recovery processing, in preparation for disabling the
+ subscription.

And then comment the function's code lines:

e.g.

/* Emit the error */
...
/* Abort any active transaction */
...
/* Reset the ErrorContext */
...

(3) DisableSubscriptionOnError return

The "if (!subform->subdisableonerr)" block should probably first:
heap_freetuple(tup);

(regardless of the fact the only current caller will proc_exit anyway)

(4) did_error flag

I think perhaps the previously-used flag name "disable_subscription"
is better, or maybe "error_recovery_done".
Also, I think it would look better if it was set AFTER
WorkerErrorRecovery() was called.

(5) DisableSubscriptionOnError LOG message

This version of the patch removes the LOG message:

+ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" will be disabled due
to error: %s",
+    MySubscription->name, edata->message));

Perhaps a similar error message could be logged prior to EmitErrorReport()?

e.g.
"logical replication subscription \"%s\" will be disabled due to an error"

Regards,
Greg Nancarrow
Fujitsu Australia

#57Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amit Kapila (#53)
Re: Optionally automatically disable logical replication subscriptions on error

On Dec 1, 2021, at 8:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The patch disables the subscription for non-transient errors. I am not
sure if we can easily make the call to decide whether any particular
error is transient or not. For example, DISK_FULL or OUT_OF_MEMORY
might not rectify itself. Why not just allow to disable the
subscription on any error? And then let the user check the error
either in view or logs and decide whether it would like to enable the
subscription or do something before it (like making space in disk, or
fixing the network).

The original idea of the patch, back when I first wrote and proposed it, was to remove the *absurdity* of retrying a transaction which, in the absence of human intervention, was guaranteed to simply fail again ad infinitum. Retrying in the face of resource errors is not *absurd* even though it might fail again ad infinitum. The reason is that there is at least a chance that the situation will clear up without human intervention.

The other problem I see with this transient error stuff is maintaining
the list of error codes that we think are transient. I think we need a
discussion for each of the error_codes we are listing now and whatever
new error_code we add in the future which doesn't seem like a good
idea.

A reasonable rule might be: "the subscription will be disabled if the server can determine that retries cannot possibly succeed without human intervention." We shouldn't need to categorize all error codes perfectly, as long as we're conservative. What I propose is similar to how we determine whether to mark a function leakproof; we don't have to mark all leakproof functions as such, we just can't mark one as such if it is not.

If we're going to debate the error codes, I think we would start with an empty list, and add to the list on sufficient analysis.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#58osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Mark Dilger (#57)
RE: Optionally automatically disable logical replication subscriptions on error

On Monday, December 6, 2021 1:38 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

On Dec 1, 2021, at 8:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The patch disables the subscription for non-transient errors. I am not
sure if we can easily make the call to decide whether any particular
error is transient or not. For example, DISK_FULL or OUT_OF_MEMORY
might not rectify itself. Why not just allow to disable the
subscription on any error? And then let the user check the error
either in view or logs and decide whether it would like to enable the
subscription or do something before it (like making space in disk, or
fixing the network).

The original idea of the patch, back when I first wrote and proposed it, was to
remove the *absurdity* of retrying a transaction which, in the absence of
human intervention, was guaranteed to simply fail again ad infinitum.
Retrying in the face of resource errors is not *absurd* even though it might fail
again ad infinitum. The reason is that there is at least a chance that the
situation will clear up without human intervention.

In my humble opinion, I felt the original purpose of the patch was to partially remedy
the situation that during the failure of apply, the apply process keeps going
into the infinite error loop.

I'd say that in this sense, if we include such resource errors, we fail to achieve
the purpose in some cases, because of some left possibilities of infinite loop.
Disabling the subscription with even one any error excludes this irregular possibility,
since there's no room to continue the infinite loop.

Best Regards,
Takamichi Osumi

#59osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Greg Nancarrow (#56)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Monday, December 6, 2021 1:16 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Sat, Dec 4, 2021 at 12:20 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Hi, I've made a new patch v11 that incorporated suggestions described

above.

Some review comments for the v11 patch:

Thank you for your reviews !

doc/src/sgml/ref/create_subscription.sgml
(1) Possible wording improvement?

BEFORE:
+  Specifies whether the subscription should be automatically disabled
+ if replicating data from the publisher triggers errors. The default
+ is <literal>false</literal>.
AFTER:
+  Specifies whether the subscription should be automatically disabled
+ if any errors are detected by subscription workers during data
+ replication from the publisher. The default is <literal>false</literal>.

Fixed.

src/backend/replication/logical/worker.c
(2) WorkerErrorRecovery comments
Instead of:

+ * As a preparation for disabling the subscription, emit the error,
+ * handle the transaction and clean up the memory context of
+ * error. ErrorContext is reset by FlushErrorState.

why not just say:

+ Worker error recovery processing, in preparation for disabling the
+ subscription.

And then comment the function's code lines:

e.g.

/* Emit the error */
...
/* Abort any active transaction */
...
/* Reset the ErrorContext */
...

Agreed. Fixed.

(3) DisableSubscriptionOnError return

The "if (!subform->subdisableonerr)" block should probably first:
heap_freetuple(tup);

(regardless of the fact the only current caller will proc_exit anyway)

Fixed.

(4) did_error flag

I think perhaps the previously-used flag name "disable_subscription"
is better, or maybe "error_recovery_done".
Also, I think it would look better if it was set AFTER
WorkerErrorRecovery() was called.

Adopted error_recovery_done
and changed its places accordingly.

(5) DisableSubscriptionOnError LOG message

This version of the patch removes the LOG message:

+ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" will be disabled due
to error: %s",
+    MySubscription->name, edata->message));

Perhaps a similar error message could be logged prior to EmitErrorReport()?

e.g.
"logical replication subscription \"%s\" will be disabled due to an error"

Added.

I've attached the new version v12.

Best Regards,
Takamichi Osumi

Attachments:

v12-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v12-0001-Optionally-disable-subscriptions-on-error.patch
#60Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Dilger (#57)
Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Dec 6, 2021 at 10:07 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Dec 1, 2021, at 8:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The patch disables the subscription for non-transient errors. I am not
sure if we can easily make the call to decide whether any particular
error is transient or not. For example, DISK_FULL or OUT_OF_MEMORY
might not rectify itself. Why not just allow to disable the
subscription on any error? And then let the user check the error
either in view or logs and decide whether it would like to enable the
subscription or do something before it (like making space in disk, or
fixing the network).

The original idea of the patch, back when I first wrote and proposed it, was to remove the *absurdity* of retrying a transaction which, in the absence of human intervention, was guaranteed to simply fail again ad infinitum. Retrying in the face of resource errors is not *absurd* even though it might fail again ad infinitum. The reason is that there is at least a chance that the situation will clear up without human intervention.

The other problem I see with this transient error stuff is maintaining
the list of error codes that we think are transient. I think we need a
discussion for each of the error_codes we are listing now and whatever
new error_code we add in the future which doesn't seem like a good
idea.

A reasonable rule might be: "the subscription will be disabled if the server can determine that retries cannot possibly succeed without human intervention." We shouldn't need to categorize all error codes perfectly, as long as we're conservative. What I propose is similar to how we determine whether to mark a function leakproof; we don't have to mark all leakproof functions as such, we just can't mark one as such if it is not.

If we're going to debate the error codes, I think we would start with an empty list, and add to the list on sufficient analysis.

Yeah, an empty list is a sort of what I thought was a good start
point. I feel we should learn from real-world use cases to see if
people really want to continue retrying even after using this option.

--
With Regards,
Amit Kapila.

#61Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: osumi.takamichi@fujitsu.com (#58)
Re: Optionally automatically disable logical replication subscriptions on error

On Dec 5, 2021, at 10:56 PM, osumi.takamichi@fujitsu.com wrote:

In my humble opinion, I felt the original purpose of the patch was to partially remedy
the situation that during the failure of apply, the apply process keeps going
into the infinite error loop.

I agree.

I'd say that in this sense, if we include such resource errors, we fail to achieve
the purpose in some cases, because of some left possibilities of infinite loop.
Disabling the subscription with even one any error excludes this irregular possibility,
since there's no room to continue the infinite loop.

I don't think there is any right answer here. It's a question of policy preferences.

My concern about disabling a subscription in response to *any* error is that people may find the feature does more harm than good. Disabling the subscription in response to an occasional deadlock against other database users, or occasional resource pressure, might annoy people and lead to the feature simply not being used.

I am happy to defer to your policy preference. Thanks for your work on the patch!


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#62Greg Nancarrow
Greg Nancarrow
gregn4422@gmail.com
In reply to: Mark Dilger (#61)
Re: Optionally automatically disable logical replication subscriptions on error

On Tue, Dec 7, 2021 at 3:06 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

My concern about disabling a subscription in response to *any* error is that people may find the feature does more harm than good. Disabling the subscription in response to an occasional deadlock against other database users, or occasional resource pressure, might annoy people and lead to the feature simply not being used.

I can understand this point of view.
It kind of suggests to me the possibility of something like a
configurable timeout (e.g. disable the subscription if the same error
has occurred for more than X minutes) or, similarly, perhaps if some
threshold has been reached (e.g. same error has occurred more than X
times), but I think that this was previously suggested by Peter Smith
and the idea wasn't looked upon all that favorably?

Regards,
Greg Nancarrow
Fujitsu Australia

#63Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Greg Nancarrow (#62)
Re: Optionally automatically disable logical replication subscriptions on error

On Tue, Dec 7, 2021 at 5:52 AM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Tue, Dec 7, 2021 at 3:06 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

My concern about disabling a subscription in response to *any* error is that people may find the feature does more harm than good. Disabling the subscription in response to an occasional deadlock against other database users, or occasional resource pressure, might annoy people and lead to the feature simply not being used.

I can understand this point of view.
It kind of suggests to me the possibility of something like a
configurable timeout (e.g. disable the subscription if the same error
has occurred for more than X minutes) or, similarly, perhaps if some
threshold has been reached (e.g. same error has occurred more than X
times), but I think that this was previously suggested by Peter Smith
and the idea wasn't looked upon all that favorably?

I think if we are really worried about transient errors then probably
the idea "disable only if the same error has occurred more than X
times" seems preferable as compared to taking a decision on which
error_codes fall in the transient error category.

--
With Regards,
Amit Kapila.

#64Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amit Kapila (#63)
Re: Optionally automatically disable logical replication subscriptions on error

On Dec 8, 2021, at 5:10 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think if we are really worried about transient errors then probably
the idea "disable only if the same error has occurred more than X
times" seems preferable as compared to taking a decision on which
error_codes fall in the transient error category.

No need. We can revisit this design decision in a later release cycle if the current patch's design proves problematic in the field.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#65Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Mark Dilger (#64)
Re: Optionally automatically disable logical replication subscriptions on error

On Wed, Dec 8, 2021 at 9:22 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

On Dec 8, 2021, at 5:10 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think if we are really worried about transient errors then probably
the idea "disable only if the same error has occurred more than X
times" seems preferable as compared to taking a decision on which
error_codes fall in the transient error category.

No need. We can revisit this design decision in a later release cycle if the current patch's design proves problematic in the field.

So, do you agree that we can disable the subscription on any error if
this parameter is set?

--
With Regards,
Amit Kapila.

#66Mark Dilger
Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amit Kapila (#65)
Re: Optionally automatically disable logical replication subscriptions on error

On Dec 8, 2021, at 8:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

So, do you agree that we can disable the subscription on any error if
this parameter is set?

Yes, I think that is fine. We can commit it that way, and revisit the issue for v16 if it becomes a problem in practice.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#67vignesh C
vignesh C
vignesh21@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#59)
Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Dec 6, 2021 at 4:22 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Monday, December 6, 2021 1:16 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Sat, Dec 4, 2021 at 12:20 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Hi, I've made a new patch v11 that incorporated suggestions described

above.

Some review comments for the v11 patch:

Thank you for your reviews !

doc/src/sgml/ref/create_subscription.sgml
(1) Possible wording improvement?

BEFORE:
+  Specifies whether the subscription should be automatically disabled
+ if replicating data from the publisher triggers errors. The default
+ is <literal>false</literal>.
AFTER:
+  Specifies whether the subscription should be automatically disabled
+ if any errors are detected by subscription workers during data
+ replication from the publisher. The default is <literal>false</literal>.

Fixed.

src/backend/replication/logical/worker.c
(2) WorkerErrorRecovery comments
Instead of:

+ * As a preparation for disabling the subscription, emit the error,
+ * handle the transaction and clean up the memory context of
+ * error. ErrorContext is reset by FlushErrorState.

why not just say:

+ Worker error recovery processing, in preparation for disabling the
+ subscription.

And then comment the function's code lines:

e.g.

/* Emit the error */
...
/* Abort any active transaction */
...
/* Reset the ErrorContext */
...

Agreed. Fixed.

(3) DisableSubscriptionOnError return

The "if (!subform->subdisableonerr)" block should probably first:
heap_freetuple(tup);

(regardless of the fact the only current caller will proc_exit anyway)

Fixed.

(4) did_error flag

I think perhaps the previously-used flag name "disable_subscription"
is better, or maybe "error_recovery_done".
Also, I think it would look better if it was set AFTER
WorkerErrorRecovery() was called.

Adopted error_recovery_done
and changed its places accordingly.

(5) DisableSubscriptionOnError LOG message

This version of the patch removes the LOG message:

+ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" will be disabled due
to error: %s",
+    MySubscription->name, edata->message));

Perhaps a similar error message could be logged prior to EmitErrorReport()?

e.g.
"logical replication subscription \"%s\" will be disabled due to an error"

Added.

I've attached the new version v12.

Thanks for the updated patch, few comments:
1) This is not required as it is not used in the caller.
+++ b/src/backend/replication/logical/launcher.c
@@ -132,6 +132,7 @@ get_subscription_list(void)
                sub->dbid = subform->subdbid;
                sub->owner = subform->subowner;
                sub->enabled = subform->subenabled;
+               sub->disableonerr = subform->subdisableonerr;
                sub->name = pstrdup(NameStr(subform->subname));
                /* We don't fill fields we are not interested in. */
2) Should this be changed:
+       <structfield>subdisableonerr</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the subscription will be disabled when subscription
+       worker detects any errors
+      </para></entry>
+     </row>
To:
+       <structfield>subdisableonerr</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the subscription will be disabled when subscription's
+       worker detects any errors
+      </para></entry>
+     </row>
3) The last line can be slightly adjusted to keep within 80 chars:
+          Specifies whether the subscription should be automatically disabled
+          if any errors are detected by subscription workers during data
+          replication from the publisher. The default is
<literal>false</literal>.
4) Similarly this too can be handled:
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1259,7 +1259,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subtwophasestate, subslotname,
subsynccommit, subpublications)
+              substream, subtwophasestate, subdisableonerr,
subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;
5) Since disabling subscription code is common in and else, can we
move it below:
+                       if (MySubscription->disableonerr)
+                       {
+                               WorkerErrorRecovery();
+                               error_recovery_done = true;
+                       }
+                       else
+                       {
+                               /*
+                                * Some work in error recovery work is
done. Switch to the old
+                                * memory context and rethrow.
+                                */
+                               MemoryContextSwitchTo(ecxt);
+                               PG_RE_THROW();
+                       }
+               }
+               else
+               {
+                       /*
+                        * Don't miss any error, even when it's not
reported to stats
+                        * collector.
+                        */
+                       if (MySubscription->disableonerr)
+                       {
+                               WorkerErrorRecovery();
+                               error_recovery_done = true;
+                       }
+                       else
+                               /* Simply rethrow because of no recovery work */
+                               PG_RE_THROW();
+               }
6) Can we move LockSharedObject below the if condition.
+       subform = (Form_pg_subscription) GETSTRUCT(tup);
+       LockSharedObject(SubscriptionRelationId, subform->oid, 0,
AccessExclusiveLock);
+
+       /*
+        * We would not be here unless this subscription's
disableonerr field was
+        * true when our worker began applying changes, but check whether that
+        * field has changed in the interim.
+        */
+       if (!subform->subdisableonerr)
+       {
+               /*
+                * Disabling the subscription has been done already. No need of
+                * additional work.
+                */
+               heap_freetuple(tup);
+               table_close(rel, RowExclusiveLock);
+               CommitTransactionCommand();
+               return;
+       }
+

Regards,
Vignesh

#68osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: vignesh C (#67)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Monday, December 13, 2021 6:57 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, Dec 6, 2021 at 4:22 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

I've attached the new version v12.

I appreciate your review.

Thanks for the updated patch, few comments:
1) This is not required as it is not used in the caller.
+++ b/src/backend/replication/logical/launcher.c
@@ -132,6 +132,7 @@ get_subscription_list(void)
sub->dbid = subform->subdbid;
sub->owner = subform->subowner;
sub->enabled = subform->subenabled;
+               sub->disableonerr = subform->subdisableonerr;
sub->name = pstrdup(NameStr(subform->subname));
/* We don't fill fields we are not interested in. */

Okay.
The comment of the get_subscription_list() mentions that
we collect and fill only fields related to worker start/stop.
Then, I didn't need it. Fixed.

2) Should this be changed:
+       <structfield>subdisableonerr</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the subscription will be disabled when subscription
+       worker detects any errors
+      </para></entry>
+     </row>
To:
+       <structfield>subdisableonerr</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the subscription will be disabled when subscription's
+       worker detects any errors
+      </para></entry>
+     </row>

I felt either is fine. So fixed.

3) The last line can be slightly adjusted to keep within 80 chars:
+          Specifies whether the subscription should be automatically disabled
+          if any errors are detected by subscription workers during data
+          replication from the publisher. The default is
<literal>false</literal>.

Fixed.

4) Similarly this too can be handled:
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1259,7 +1259,7 @@ REVOKE ALL ON pg_replication_origin_status FROM
public;
-- All columns of pg_subscription except subconninfo are publicly readable.
REVOKE ALL ON pg_subscription FROM public;  GRANT SELECT (oid,
subdbid, subname, subowner, subenabled, subbinary,
-              substream, subtwophasestate, subslotname,
subsynccommit, subpublications)
+              substream, subtwophasestate, subdisableonerr,
subslotname, subsynccommit, subpublications)
ON pg_subscription TO public;

I split the line into two to make each line less than 80 chars.

5) Since disabling subscription code is common in and else, can we move it
below:
+                       if (MySubscription->disableonerr)
+                       {
+                               WorkerErrorRecovery();
+                               error_recovery_done = true;
+                       }
+                       else
+                       {
+                               /*
+                                * Some work in error recovery work is
done. Switch to the old
+                                * memory context and rethrow.
+                                */
+                               MemoryContextSwitchTo(ecxt);
+                               PG_RE_THROW();
+                       }
+               }
+               else
+               {
+                       /*
+                        * Don't miss any error, even when it's not
reported to stats
+                        * collector.
+                        */
+                       if (MySubscription->disableonerr)
+                       {
+                               WorkerErrorRecovery();
+                               error_recovery_done = true;
+                       }
+                       else
+                               /* Simply rethrow because of no recovery
work */
+                               PG_RE_THROW();
+               }

I moved the common code below those condition branches.

6) Can we move LockSharedObject below the if condition.
+       subform = (Form_pg_subscription) GETSTRUCT(tup);
+       LockSharedObject(SubscriptionRelationId, subform->oid, 0,
AccessExclusiveLock);
+
+       /*
+        * We would not be here unless this subscription's
disableonerr field was
+        * true when our worker began applying changes, but check whether
that
+        * field has changed in the interim.
+        */
+       if (!subform->subdisableonerr)
+       {
+               /*
+                * Disabling the subscription has been done already. No need
of
+                * additional work.
+                */
+               heap_freetuple(tup);
+               table_close(rel, RowExclusiveLock);
+               CommitTransactionCommand();
+               return;
+       }
+

Fixed.

Besides all of those changes, I've removed the obsolete
comment of DisableSubscriptionOnError in v12.

Best Regards,
Takamichi Osumi

Attachments:

v13-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v13-0001-Optionally-disable-subscriptions-on-error.patch
#69Greg Nancarrow
Greg Nancarrow
gregn4422@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#68)
Re: Optionally automatically disable logical replication subscriptions on error

On Tue, Dec 14, 2021 at 4:34 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Besides all of those changes, I've removed the obsolete
comment of DisableSubscriptionOnError in v12.

I have a few minor comments, otherwise the patch LGTM at this point:

doc/src/sgml/catalogs.sgml
(1)
Current comment says:

+       If true, the subscription will be disabled when subscription's
+       worker detects any errors

However, in create_subscription.sgml, it says "disabled if any errors
are detected by subscription workers ..."

For consistency, I think it should be:

+       If true, the subscription will be disabled when subscription
+       workers detect any errors

src/bin/psql/describe.c
(2)
I think that:

+ gettext_noop("Disable On Error"));

should be:

+ gettext_noop("Disable on error"));

for consistency with the uppercase/lowercase usage on other similar entries?
(e.g. "Two phase commit")

src/include/catalog/pg_subscription.h
(3)

+  bool subdisableonerr; /* True if apply errors should disable the
+                                       * subscription upon error */

The comment should just say "True if occurrence of apply errors should
disable the subscription"

Regards,
Greg Nancarrow
Fujitsu Australia

#70osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Greg Nancarrow (#69)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Thursday, December 16, 2021 2:32 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Tue, Dec 14, 2021 at 4:34 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Besides all of those changes, I've removed the obsolete comment of
DisableSubscriptionOnError in v12.

I have a few minor comments, otherwise the patch LGTM at this point:

Thank you for your review !

doc/src/sgml/catalogs.sgml
(1)
Current comment says:

+       If true, the subscription will be disabled when subscription's
+       worker detects any errors

However, in create_subscription.sgml, it says "disabled if any errors are
detected by subscription workers ..."

For consistency, I think it should be:

+       If true, the subscription will be disabled when subscription
+       workers detect any errors

Okay. Fixed.

src/bin/psql/describe.c
(2)
I think that:

+ gettext_noop("Disable On Error"));

should be:

+ gettext_noop("Disable on error"));

for consistency with the uppercase/lowercase usage on other similar entries?
(e.g. "Two phase commit")

Agreed. Fixed.

src/include/catalog/pg_subscription.h
(3)

+  bool subdisableonerr; /* True if apply errors should disable the
+                                       * subscription upon error */

The comment should just say "True if occurrence of apply errors should disable
the subscription"

Fixed.

Attached the updated patch v14.

Best Regards,
Takamichi Osumi

Attachments:

v14-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v14-0001-Optionally-disable-subscriptions-on-error.patch
#71osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#70)
RE: Optionally automatically disable logical replication subscriptions on error

On Thursday, December 16, 2021 9:51 PM I wrote:

Attached the updated patch v14.

FYI, I've conducted a test of disable_on_error flag using
pg_upgrade. I prepared PG14 and HEAD applied with disable_on_error patch.
Then, I setup a logical replication pair of the publisher and the subscriber by 14
and executed pg_upgrade for both the publisher and the subscriber individually.

After the updation, on the subscriber, I've confirmed the disable_on_error is false
via both pg_subscription and \dRs+, as expected.

Best Regards,
Takamichi Osumi

#72osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#71)
RE: Optionally automatically disable logical replication subscriptions on error

On Tuesday, December 21, 2021 11:18 PM I wrote:

On Thursday, December 16, 2021 9:51 PM I wrote:

Attached the updated patch v14.

FYI, I've conducted a test of disable_on_error flag using pg_upgrade. I
prepared PG14 and HEAD applied with disable_on_error patch.
Then, I setup a logical replication pair of the publisher and the subscriber by 14
and executed pg_upgrade for both the publisher and the subscriber
individually.

After the updation, on the subscriber, I've confirmed the disable_on_error is
false via both pg_subscription and \dRs+, as expected.

Additionally, I've tested the new TAP test in a tight loop
that executed 027_disable_on_error.pl 100 times sequentially.
There was no failure, which means
any timing issue should not exist in the test.

Best Regards,
Takamichi Osumi

#73wangw.fnst@fujitsu.com
wangw.fnst@fujitsu.com
wangw.fnst@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#70)
RE: Optionally automatically disable logical replication subscriptions on error

On Thursday, December 16, 2021 8:51 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote:

Attached the updated patch v14.

A comment to the timing of printing a log:
After the log[1]"LOG: logical replication subscription "sub1" will be disabled due to an error" was printed, I altered subscription's option
(DISABLE_ON_ERROR) from true to false before invoking DisableSubscriptionOnError
to disable subscription. Subscription was not disabled.
[1]: "LOG: logical replication subscription "sub1" will be disabled due to an error"

I found this log is printed in function WorkerErrorRecovery:
+	ereport(LOG,
+			errmsg("logical replication subscription \"%s\" will be disabled due to an error",
+				   MySubscription->name));
This log is printed here, but in DisableSubscriptionOnError, there is a check to
confirm subscription's disableonerr field. If disableonerr is found changed from
true to false in DisableSubscriptionOnError, subscription will not be disabled.

In this case, "disable subscription" is printed, but subscription will not be
disabled actually.
I think it is a little confused to user, so what about moving this message after
the check which is mentioned above in DisableSubscriptionOnError?

Regards,
Wang wei

#74osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: wangw.fnst@fujitsu.com (#73)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Tuesday, December 28, 2021 11:53 AM Wang, Wei/王 威 <wangw.fnst@fujitsu.com> wrote:

On Thursday, December 16, 2021 8:51 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Attached the updated patch v14.

A comment to the timing of printing a log:

Thank you for your review !

After the log[1] was printed, I altered subscription's option
(DISABLE_ON_ERROR) from true to false before invoking
DisableSubscriptionOnError to disable subscription. Subscription was not
disabled.
[1] "LOG: logical replication subscription "sub1" will be disabled due to an
error"

I found this log is printed in function WorkerErrorRecovery:
+	ereport(LOG,
+			errmsg("logical replication subscription \"%s\" will
be disabled due to an error",
+				   MySubscription->name));
This log is printed here, but in DisableSubscriptionOnError, there is a check to
confirm subscription's disableonerr field. If disableonerr is found changed from
true to false in DisableSubscriptionOnError, subscription will not be disabled.

In this case, "disable subscription" is printed, but subscription will not be
disabled actually.
I think it is a little confused to user, so what about moving this message after
the check which is mentioned above in DisableSubscriptionOnError?

Makes sense. I moved the log print after
the check of the necessity to disable the subscription.

Also, I've scrutinized and refined the new TAP test as well for refactoring.
As a result, I fixed wait_for_subscriptions()
so that some extra codes that can be simplified,
such as escaped variable and one part of WHERE clause, are removed.
Other change I did is to replace two calls of wait_for_subscriptions()
with polling_query_until() for the subscriber, in order to
make the tests better and more suitable for the test purposes.
Again, for this refinement, I've conducted a tight loop test
to check no timing issue and found no problem.

Best Regards,
Takamichi Osumi

Attachments:

v15-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v15-0001-Optionally-disable-subscriptions-on-error.patch
#75tanghy.fnst@fujitsu.com
tanghy.fnst@fujitsu.com
tanghy.fnst@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#74)
RE: Optionally automatically disable logical replication subscriptions on error

On Wednesday, January 5, 2022 8:53 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Tuesday, December 28, 2021 11:53 AM Wang, Wei/王 威
<wangw.fnst@fujitsu.com> wrote:

On Thursday, December 16, 2021 8:51 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Attached the updated patch v14.

A comment to the timing of printing a log:

Thank you for your review !

After the log[1] was printed, I altered subscription's option
(DISABLE_ON_ERROR) from true to false before invoking
DisableSubscriptionOnError to disable subscription. Subscription was not
disabled.
[1] "LOG: logical replication subscription "sub1" will be disabled due to an
error"

I found this log is printed in function WorkerErrorRecovery:
+	ereport(LOG,
+			errmsg("logical replication subscription \"%s\" will
be disabled due to an error",
+				   MySubscription->name));
This log is printed here, but in DisableSubscriptionOnError, there is a check to
confirm subscription's disableonerr field. If disableonerr is found changed from
true to false in DisableSubscriptionOnError, subscription will not be disabled.

In this case, "disable subscription" is printed, but subscription will not be
disabled actually.
I think it is a little confused to user, so what about moving this message after
the check which is mentioned above in DisableSubscriptionOnError?

Makes sense. I moved the log print after
the check of the necessity to disable the subscription.

Also, I've scrutinized and refined the new TAP test as well for refactoring.
As a result, I fixed wait_for_subscriptions()
so that some extra codes that can be simplified,
such as escaped variable and one part of WHERE clause, are removed.
Other change I did is to replace two calls of wait_for_subscriptions()
with polling_query_until() for the subscriber, in order to
make the tests better and more suitable for the test purposes.
Again, for this refinement, I've conducted a tight loop test
to check no timing issue and found no problem.

Thanks for updating the patch. Here are some comments:

1)
+	/*
+	 * We would not be here unless this subscription's disableonerr field was
+	 * true when our worker began applying changes, but check whether that
+	 * field has changed in the interim.
+	 */
+	if (!subform->subdisableonerr)
+	{
+		/*
+		 * Disabling the subscription has been done already. No need of
+		 * additional work.
+		 */
+		heap_freetuple(tup);
+		table_close(rel, RowExclusiveLock);
+		CommitTransactionCommand();
+		return;
+	}

I don't understand what does "Disabling the subscription has been done already"
mean, I think we only run here when subdisableonerr is changed in the interim.
Should we modify this comment? Or remove it because there are already some
explanations before.

2)
+	/* Set the subscription to disabled, and note the reason. */
+	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(false);
+	replaces[Anum_pg_subscription_subenabled - 1] = true;

I didn't see the code corresponding to "note the reason". Should we modify the
comment?

3)
+ bool disableonerr; /* Whether errors automatically disable */

This comment is hard to understand. Maybe it can be changed to:

Indicates if the subscription should be automatically disabled when subscription
workers detect any errors.

Regards,
Tang

#76osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: tanghy.fnst@fujitsu.com (#75)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Thursday, January 6, 2022 12:17 PM Tang, Haiying/唐 海英 <tanghy.fnst@fujitsu.com> wrote:

Thanks for updating the patch. Here are some comments:

Thank you for your review !

1)
+	/*
+	 * We would not be here unless this subscription's disableonerr field
was
+	 * true when our worker began applying changes, but check whether
that
+	 * field has changed in the interim.
+	 */
+	if (!subform->subdisableonerr)
+	{
+		/*
+		 * Disabling the subscription has been done already. No need
of
+		 * additional work.
+		 */
+		heap_freetuple(tup);
+		table_close(rel, RowExclusiveLock);
+		CommitTransactionCommand();
+		return;
+	}

I don't understand what does "Disabling the subscription has been done
already"
mean, I think we only run here when subdisableonerr is changed in the interim.
Should we modify this comment? Or remove it because there are already some
explanations before.

Removed. The description you pointed out was redundant.

2)
+	/* Set the subscription to disabled, and note the reason. */
+	values[Anum_pg_subscription_subenabled - 1] =
BoolGetDatum(false);
+	replaces[Anum_pg_subscription_subenabled - 1] = true;

I didn't see the code corresponding to "note the reason". Should we modify the
comment?

Fixed the comment by removing the part.
We come here when an error occurred and the reason is printed as log
so no need to note more reason.

3)
+ bool disableonerr; /* Whether errors automatically
disable */

This comment is hard to understand. Maybe it can be changed to:

Indicates if the subscription should be automatically disabled when
subscription workers detect any errors.

Agreed. Fixed.

Kindly have a look at the attached v16.

Best Regards,
Takamichi Osumi

Attachments:

v16-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v16-0001-Optionally-disable-subscriptions-on-error.patch
#77Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#76)
Re: Optionally automatically disable logical replication subscriptions on error

On Thu, Jan 6, 2022 at 11:23 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Kindly have a look at the attached v16.

Few comments:
=============
1.
@@ -3594,13 +3698,29 @@ ApplyWorkerMain(Datum main_arg)
    apply_error_callback_arg.command,
    apply_error_callback_arg.remote_xid,
    errdata->message);
- MemoryContextSwitchTo(ecxt);
+
+ if (!MySubscription->disableonerr)
+ {
+ /*
+ * Some work in error recovery work is done. Switch to the old
+ * memory context and rethrow.
+ */
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+ }
  }
+ else if (!MySubscription->disableonerr)
+ PG_RE_THROW();

- PG_RE_THROW();

Can't we combine these two different checks for
'MySubscription->disableonerr' if you do it as a separate if check
after sending the stats message?

2. Can we move the code related to tablesync worker and its error
handing (the code insider if (am_tablesync_worker())) to a separate
function say LogicalRepHandleTableSync() or something like that.

3. Similarly, we can move apply-loop related code ("Run the main
loop.") to a separate function say LogicalRepHandleApplyMessages().

If we do (2) and (3), I think the code in ApplyWorkerMain will look
better. What do you think?

--
With Regards,
Amit Kapila.

#78osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#77)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Monday, February 14, 2022 8:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 6, 2022 at 11:23 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Kindly have a look at the attached v16.

Few comments:

Hi, thank you for checking the patch !

=============
1.
@@ -3594,13 +3698,29 @@ ApplyWorkerMain(Datum main_arg)
apply_error_callback_arg.command,
apply_error_callback_arg.remote_xid,
errdata->message);
- MemoryContextSwitchTo(ecxt);
+
+ if (!MySubscription->disableonerr)
+ {
+ /*
+ * Some work in error recovery work is done. Switch to the old
+ * memory context and rethrow.
+ */
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+ }
}
+ else if (!MySubscription->disableonerr) PG_RE_THROW();

- PG_RE_THROW();

Can't we combine these two different checks for
'MySubscription->disableonerr' if you do it as a separate if check after sending
the stats message?

No, we can't. The second check of MySubscription->disableonerr is for the case
apply_error_callback_arg.command equals 0. We disable the subscription
on any errors. In other words, we need to rethrow the error in the case,
if the flag disableonerr is not set to true.

So, moving it to after sending
the stats message can't be done. At the same time, if we move
the disableonerr flag check outside of the apply_error_callback_arg.command condition
branch, we need to write another call of pgstat_report_subworker_error, with the
same arguments that we have now. This wouldn't be preferrable as well.

2. Can we move the code related to tablesync worker and its error handing (the
code insider if (am_tablesync_worker())) to a separate function say
LogicalRepHandleTableSync() or something like that.

3. Similarly, we can move apply-loop related code ("Run the main
loop.") to a separate function say LogicalRepHandleApplyMessages().

If we do (2) and (3), I think the code in ApplyWorkerMain will look better. What
do you think?

I agree with (2) and (3), since those contribute to better readability.

Attached a new patch v17 that addresses those refactorings.

Best Regards,
Takamichi Osumi

Attachments:

v17-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v17-0001-Optionally-disable-subscriptions-on-error.patch
#79osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#78)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Tuesday, February 15, 2022 2:19 PM I wrote

On Monday, February 14, 2022 8:58 PM Amit Kapila

2. Can we move the code related to tablesync worker and its error
handing (the code insider if (am_tablesync_worker())) to a separate
function say
LogicalRepHandleTableSync() or something like that.

3. Similarly, we can move apply-loop related code ("Run the main
loop.") to a separate function say LogicalRepHandleApplyMessages().

If we do (2) and (3), I think the code in ApplyWorkerMain will look
better. What do you think?

I agree with (2) and (3), since those contribute to better readability.

Attached a new patch v17 that addresses those refactorings.

Hi, I noticed that one new tap test was added in the src/test/subscription/
and needed to increment the number of my test of this patch.

Also, I conducted minor fixes of comments and function name.
Kindly have a look at the attached v18.

Best Regards,
Takamichi Osumi

Attachments:

v18-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v18-0001-Optionally-disable-subscriptions-on-error.patch
#80Peter Smith
Peter Smith
smithpb2250@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#79)
1 attachment(s)
Re: Optionally automatically disable logical replication subscriptions on error

Hi. Below are my code review comments for v18.

==========

1. Commit Message - wording

BEFORE
To partially remedy the situation, adding a new subscription_parameter
named 'disable_on_error'.

AFTER
To partially remedy the situation, this patch adds a new
subscription_parameter named 'disable_on_error'.

~~~

2. Commit message - wording

BEFORE
Require to bump catalog version.

AFTER
A catalog version bump is required.

~~~

3. doc/src/sgml/ref/alter_subscription.sgml - whitespace

@@ -201,8 +201,8 @@ ALTER SUBSCRIPTION <replaceable
class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,<literal>streaming</literal>, and
+      <literal>disable_on_error</literal>.
      </para>

There is a missing space before <literal>streaming</literal>.

~~~

4. src/backend/replication/logical/worker.c - WorkerErrorRecovery

@@ -2802,6 +2803,89 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
}

 /*
+ * Worker error recovery processing, in preparation for disabling the
+ * subscription.
+ */
+static void
+WorkerErrorRecovery(void)

I was wondering about the need for this to be a separate function? It
is only called immediately before calling 'DisableSubscriptionOnError'
so would it maybe be better just to put this code inside
DisableSubscriptionOnError with the appropriate comments?

~~~

5. src/backend/replication/logical/worker.c - DisableSubscriptionOnError

+ /*
+ * We would not be here unless this subscription's disableonerr field was
+ * true when our worker began applying changes, but check whether that
+ * field has changed in the interim.
+ */

Apparently, this function might just do nothing if it detects some
situation where the flag was changed somehow, but I’m not 100% sure
that the callers are properly catering for when nothing happens.

IMO it would be better if this function would return true/false to
mean "did disable subscription happen or not?" because that will give
the calling code the chance to check the function return and do the
right thing - e.g. if the caller first thought it should be disabled
but then it turned out it did NOT disable...

~~~

6. src/backend/replication/logical/worker.c - LogicalRepHandleTableSync name

+/*
+ * Execute the initial sync with error handling. Disable the subscription,
+ * if it's required.
+ */
+static void
+LogicalRepHandleTableSync(XLogRecPtr *origin_startpos,
+   char **myslotname, MemoryContext cctx)

I felt that it is a bit overkill to put a "LogicalRep" prefix here
because it is a static function.

IMO this function should be renamed as 'SyncTableStartWrapper' because
that describes better what it is doing.

~~~

7. src/backend/replication/logical/worker.c - LogicalRepHandleTableSync Assert

Even though we can know this to be true because of where it is called
from, I think the readability of the function will be improved if you
add an assertion at the top:

Assert(am_tablesync_worker());

And then, because the function is clearly for Tablesync worker only
there is no need to keep mentioning that in the subsequent comments...

e.g.1
/* This is table synchronization worker, call initial sync. */
AFTER:
/* Call initial sync. */

e.g.2
/*
* Report the table sync error. There is no corresponding message type
* for table synchronization.
*/
AFTER
/*
* Report the error. There is no corresponding message type for table
* synchronization.
*/

~~~

8. src/backend/replication/logical/worker.c -
LogicalRepHandleTableSync unnecessarily complex

+static void
+LogicalRepHandleTableSync(XLogRecPtr *origin_startpos,
+   char **myslotname, MemoryContext cctx)
+{
+ char    *syncslotname;
+ bool error_recovery_done = false;

IMO this logic is way more complex than it needed to be. IIUC that
'error_recovery_done' and various conditions can be removed, and the
whole thing be simplified quite a lot.

I re-wrote this function as a POC. Please see the attached file [2]worker.c.peter.txt is same as your v18 worker.c but I re-wrote functions LogicalRepHandleTableSync and LogicalRepHandleApplyMessages as POC.
All the tests are still passing OK.

(Perhaps the scenario for my comment #5 above still needs to be addressed?)

~~~

9. src/backend/replication/logical/worker.c - LogicalRepHandleApplyMessages name

+/*
+ * Run the apply loop with error handling. Disable the subscription,
+ * if necessary.
+ */
+static void
+LogicalRepHandleApplyMessages(XLogRecPtr origin_startpos,
+   MemoryContext cctx)

I felt that it is a bit overkill to put a "LogicalRep" prefix here
because it is a static function.

IMO this function should be renamed as 'ApplyLoopWrapper' because that
describes better what it is doing.

~~~

10. src/backend/replication/logical/worker.c -
LogicalRepHandleApplyMessages unnecessarily complex

+static void
+LogicalRepHandleApplyMessages(XLogRecPtr origin_startpos,
+   MemoryContext cctx)
+{
+ bool error_recovery_done = false;

IMO this logic is way more complex than it needed to be. IIUC that
'error_recovery_done' and various conditions can be removed, and the
whole thing be simplified quite a lot.

I re-wrote this function as a POC. Please see the attached file [2]worker.c.peter.txt is same as your v18 worker.c but I re-wrote functions LogicalRepHandleTableSync and LogicalRepHandleApplyMessages as POC.
All the tests are still passing OK.

(Perhaps the scenario for my comment #5 above still needs to be addressed?)

~~~

11. src/bin/pg_dump/pg_dump.c - dumpSubscription

@@ -4441,6 +4451,9 @@ dumpSubscription(Archive *fout, const
SubscriptionInfo *subinfo)
if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
appendPQExpBufferStr(query, ", two_phase = on");

+ if (strcmp(subinfo->subdisableonerr, "f") != 0)
+ appendPQExpBufferStr(query, ", disable_on_error = on");
+

I felt saying disable_on_err is "true" would look more natural than
saying it is "on".

~~~

12. src/bin/psql/describe.c - describeSubscriptions typo

@@ -6096,11 +6096,13 @@ describeSubscriptions(const char *pattern, bool verbose)
gettext_noop("Binary"),
gettext_noop("Streaming"));

- /* Two_phase is only supported in v15 and higher */
+ /* Two_phase and disable_on_error is only supported in v15 and higher */

Typo

"is only" --> "are only"

~~~

13. src/include/catalog/pg_subscription.h - comments

@@ -103,6 +106,9 @@ typedef struct Subscription
  * binary format */
  bool stream; /* Allow streaming in-progress transactions. */
  char twophasestate; /* Allow streaming two-phase transactions */
+ bool disableonerr; /* Indicates if the subscription should be
+ * automatically disabled when subscription
+ * workers detect any errors. */

It's not usual to have a full stop here.
Maybe not needed to repeat the word "subscription".
IMO, generally, it all can be simplified a bit.

BEFORE
Indicates if the subscription should be automatically disabled when
subscription workers detect any errors.

AFTER
Indicates if the subscription should be automatically disabled if a
worker error occurs

~~~

14. src/test/regress/sql/subscription.sql - missing test case.

The "conflicting options" error from the below code is not currently
being tested.

@@ -249,6 +253,15 @@ parse_subscription_options(ParseState *pstate,
List *stmt_options,
  opts->specified_opts |= SUBOPT_TWOPHASE_COMMIT;
  opts->twophase = defGetBoolean(defel);
  }
+ else if (IsSet(supported_opts, SUBOPT_DISABLE_ON_ERR) &&
+ strcmp(defel->defname, "disable_on_error") == 0)
+ {
+ if (IsSet(opts->specified_opts, SUBOPT_DISABLE_ON_ERR))
+ errorConflictingDefElem(defel, pstate);

~~~

15. src/test/subscription/t/028_disable_on_error.pl - 028 clash

Just a heads-up that this 028 is going to clash with the Row-Filter
patch 028 which has been announced to be pushed soon, so be prepared
to change this number again shortly :)

~~~

16. src/test/subscription/t/028_disable_on_error.pl - done_testing

AFAIK is a new style now for the TAP tests where it uses
"done_testing();" instead of saying up-front how many tests there are.
See here [1]https://github.com/postgres/postgres/commit/549ec201d6132b7c7ee11ee90a4e02119259ba5b.

~~~

17. src/test/subscription/t/028_disable_on_error.pl - more comments

+# Create an additional unique index in schema s1 on the subscriber only.  When
+# we create subscriptions, below, this should cause subscription "s1" on the
+# subscriber to fail during initial synchronization and to get automatically
+# disabled.

I felt it could be made a bit more obvious upfront in a comment that 2
pairs of pub/sub will be created, and their names will same as the
schemas:
e.g.
Publisher "s1" --> Subscriber "s1"
Publisher "s2" --> Subscriber "s2"

~~~

18. src/test/subscription/t/028_disable_on_error.pl - ALTER tests?

The tests here are only using the hardwired 'disable_on_error' options
set at CREATE SUBSCRIPTION time. There are no TAP tests for changing
the disable_on_error using ALTER SUBSCRIPTION.

Should there be?

------
[1]: https://github.com/postgres/postgres/commit/549ec201d6132b7c7ee11ee90a4e02119259ba5b
[2]: worker.c.peter.txt is same as your v18 worker.c but I re-wrote functions LogicalRepHandleTableSync and LogicalRepHandleApplyMessages as POC
functions LogicalRepHandleTableSync and LogicalRepHandleApplyMessages
as POC

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

worker.c.peter.txttext/plain; charset=US-ASCII; name=worker.c.peter.txt
#81osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Peter Smith (#80)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Friday, February 18, 2022 3:27 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi. Below are my code review comments for v18.

Thank you for your review !

==========

1. Commit Message - wording

BEFORE
To partially remedy the situation, adding a new subscription_parameter named
'disable_on_error'.

AFTER
To partially remedy the situation, this patch adds a new
subscription_parameter named 'disable_on_error'.

Fixed.

~~~

2. Commit message - wording

BEFORE
Require to bump catalog version.

AFTER
A catalog version bump is required.

Fixed.

~~~

3. doc/src/sgml/ref/alter_subscription.sgml - whitespace

@@ -201,8 +201,8 @@ ALTER SUBSCRIPTION <replaceable
class="parameter">name</replaceable> RENAME TO <
information.  The parameters that can be altered
are <literal>slot_name</literal>,
<literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,<literal>streaming</literal>, and
+      <literal>disable_on_error</literal>.
</para>

There is a missing space before <literal>streaming</literal>.

Fixed.

~~~

4. src/backend/replication/logical/worker.c - WorkerErrorRecovery

@@ -2802,6 +2803,89 @@ LogicalRepApplyLoop(XLogRecPtr
last_received) }

/*
+ * Worker error recovery processing, in preparation for disabling the
+ * subscription.
+ */
+static void
+WorkerErrorRecovery(void)

I was wondering about the need for this to be a separate function? It is only
called immediately before calling 'DisableSubscriptionOnError'
so would it maybe be better just to put this code inside
DisableSubscriptionOnError with the appropriate comments?

I preferred to have one specific for error handling,
because from caller sides, when we catch error, it's apparent
that error recovery is done. But, the function name "DisableSubscriptionOnError"
by itself should have the nuance that we do something on error.
So, we can think that it's okay to have error recovery processing
in this function.

So, I removed the function and fixed some related comments.

~~~

5. src/backend/replication/logical/worker.c - DisableSubscriptionOnError

+ /*
+ * We would not be here unless this subscription's disableonerr field
+ was
+ * true when our worker began applying changes, but check whether that
+ * field has changed in the interim.
+ */

Apparently, this function might just do nothing if it detects some situation
where the flag was changed somehow, but I'm not 100% sure that the callers
are properly catering for when nothing happens.

IMO it would be better if this function would return true/false to mean "did
disable subscription happen or not?" because that will give the calling code the
chance to check the function return and do the right thing - e.g. if the caller first
thought it should be disabled but then it turned out it did NOT disable...

I don't think we need to do something more.
After this function, table sync worker and the apply worker
just exit. IMO, we don't need to do additional work for
already-disabled subscription on the caller sides.
It should be sufficient to fulfill the purpose of
DisableSubscriptionOnError or confirm it has been fulfilled.

~~~

6. src/backend/replication/logical/worker.c - LogicalRepHandleTableSync
name

+/*
+ * Execute the initial sync with error handling. Disable the
+subscription,
+ * if it's required.
+ */
+static void
+LogicalRepHandleTableSync(XLogRecPtr *origin_startpos,
+   char **myslotname, MemoryContext cctx)

I felt that it is a bit overkill to put a "LogicalRep" prefix here because it is a static
function.

IMO this function should be renamed as 'SyncTableStartWrapper' because that
describes better what it is doing.

Makes sense. Fixed.

~~~

7. src/backend/replication/logical/worker.c - LogicalRepHandleTableSync
Assert

Even though we can know this to be true because of where it is called from, I
think the readability of the function will be improved if you add an assertion at
the top:

Assert(am_tablesync_worker());

Fixed.

And then, because the function is clearly for Tablesync worker only there is no
need to keep mentioning that in the subsequent comments...

e.g.1
/* This is table synchronization worker, call initial sync. */
AFTER:
/* Call initial sync. */

Fixed.

e.g.2
/*
* Report the table sync error. There is no corresponding message type
* for table synchronization.
*/
AFTER
/*
* Report the error. There is no corresponding message type for table
* synchronization.
*/

Agreed. Fixed

~~~

8. src/backend/replication/logical/worker.c - LogicalRepHandleTableSync
unnecessarily complex

+static void
+LogicalRepHandleTableSync(XLogRecPtr *origin_startpos,
+   char **myslotname, MemoryContext cctx) {
+ char    *syncslotname;
+ bool error_recovery_done = false;

IMO this logic is way more complex than it needed to be. IIUC that
'error_recovery_done' and various conditions can be removed, and the whole
thing be simplified quite a lot.

I re-wrote this function as a POC. Please see the attached file [2].
All the tests are still passing OK.

(Perhaps the scenario for my comment #5 above still needs to be addressed?)

Removed the 'error_recovery_done' flag and fixed.

~~~

9. src/backend/replication/logical/worker.c -
LogicalRepHandleApplyMessages name

+/*
+ * Run the apply loop with error handling. Disable the subscription,
+ * if necessary.
+ */
+static void
+LogicalRepHandleApplyMessages(XLogRecPtr origin_startpos,
+   MemoryContext cctx)

I felt that it is a bit overkill to put a "LogicalRep" prefix here because it is a static
function.

IMO this function should be renamed as 'ApplyLoopWrapper' because that
describes better what it is doing.

Fixed.

~~~

10. src/backend/replication/logical/worker.c -
LogicalRepHandleApplyMessages unnecessarily complex

+static void
+LogicalRepHandleApplyMessages(XLogRecPtr origin_startpos,
+   MemoryContext cctx)
+{
+ bool error_recovery_done = false;

IMO this logic is way more complex than it needed to be. IIUC that
'error_recovery_done' and various conditions can be removed, and the whole
thing be simplified quite a lot.

I re-wrote this function as a POC. Please see the attached file [2].
All the tests are still passing OK.

(Perhaps the scenario for my comment #5 above still needs to be addressed?)

Fixed.

~~~

11. src/bin/pg_dump/pg_dump.c - dumpSubscription

@@ -4441,6 +4451,9 @@ dumpSubscription(Archive *fout, const
SubscriptionInfo *subinfo)
if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
appendPQExpBufferStr(query, ", two_phase = on");

+ if (strcmp(subinfo->subdisableonerr, "f") != 0)
+ appendPQExpBufferStr(query, ", disable_on_error = on");
+

I felt saying disable_on_err is "true" would look more natural than saying it is
"on".

Fixed.

~~~

12. src/bin/psql/describe.c - describeSubscriptions typo

@@ -6096,11 +6096,13 @@ describeSubscriptions(const char *pattern, bool
verbose)
gettext_noop("Binary"),
gettext_noop("Streaming"));

- /* Two_phase is only supported in v15 and higher */
+ /* Two_phase and disable_on_error is only supported in v15 and higher
+ */

Typo

"is only" --> "are only"

Fixed.

~~~

13. src/include/catalog/pg_subscription.h - comments

@@ -103,6 +106,9 @@ typedef struct Subscription
* binary format */
bool stream; /* Allow streaming in-progress transactions. */
char twophasestate; /* Allow streaming two-phase transactions */
+ bool disableonerr; /* Indicates if the subscription should be
+ * automatically disabled when subscription
+ * workers detect any errors. */

It's not usual to have a full stop here.
Maybe not needed to repeat the word "subscription".
IMO, generally, it all can be simplified a bit.

BEFORE
Indicates if the subscription should be automatically disabled when
subscription workers detect any errors.

AFTER
Indicates if the subscription should be automatically disabled if a worker error
occurs

Fixed.

~~~

14. src/test/regress/sql/subscription.sql - missing test case.

The "conflicting options" error from the below code is not currently being
tested.

@@ -249,6 +253,15 @@ parse_subscription_options(ParseState *pstate, List
*stmt_options,
opts->specified_opts |= SUBOPT_TWOPHASE_COMMIT;
opts->twophase = defGetBoolean(defel);
}
+ else if (IsSet(supported_opts, SUBOPT_DISABLE_ON_ERR) &&
+ strcmp(defel->defname, "disable_on_error") == 0) { if
+ (IsSet(opts->specified_opts, SUBOPT_DISABLE_ON_ERR))
+ errorConflictingDefElem(defel, pstate);

We don't have this test in other options as well.
So, this should be aligned.

~~~

15. src/test/subscription/t/028_disable_on_error.pl - 028 clash

Just a heads-up that this 028 is going to clash with the Row-Filter patch 028
which has been announced to be pushed soon, so be prepared to change this
number again shortly :)

Thank you for letting me know.

~~~

16. src/test/subscription/t/028_disable_on_error.pl - done_testing

AFAIK is a new style now for the TAP tests where it uses "done_testing();"
instead of saying up-front how many tests there are.
See here [1].

Fixed.

~~~

17. src/test/subscription/t/028_disable_on_error.pl - more comments

+# Create an additional unique index in schema s1 on the subscriber
+only.  When # we create subscriptions, below, this should cause
+subscription "s1" on the # subscriber to fail during initial
+synchronization and to get automatically # disabled.

I felt it could be made a bit more obvious upfront in a comment that 2 pairs of
pub/sub will be created, and their names will same as the
schemas:
e.g.
Publisher "s1" --> Subscriber "s1"
Publisher "s2" --> Subscriber "s2"

Comments are fixed.

~~~

18. src/test/subscription/t/028_disable_on_error.pl - ALTER tests?

The tests here are only using the hardwired 'disable_on_error' options set at
CREATE SUBSCRIPTION time. There are no TAP tests for changing the
disable_on_error using ALTER SUBSCRIPTION.

Should there be?

I don't think so. Toggling the flag 'disable_on_error' is already tested
in the subscription.sql file. Both new paths for table sync and apply
worker to disable on error are already covered.

FYI : I skipped one change of worker.c.peter.txt
about "enabled" flag, which is independent from
disable_on_error option.

Kindly have a look at the attached v19.

Best Regards,
Takamichi Osumi

Attachments:

v19-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v19-0001-Optionally-disable-subscriptions-on-error.patch
#82Peter Smith
Peter Smith
smithpb2250@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#81)
1 attachment(s)
Re: Optionally automatically disable logical replication subscriptions on error

Thanks for addressing my previous comments. Now I have looked at v19.

On Mon, Feb 21, 2022 at 11:25 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Friday, February 18, 2022 3:27 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi. Below are my code review comments for v18.

Thank you for your review !

...

5. src/backend/replication/logical/worker.c - DisableSubscriptionOnError

+ /*
+ * We would not be here unless this subscription's disableonerr field
+ was
+ * true when our worker began applying changes, but check whether that
+ * field has changed in the interim.
+ */

Apparently, this function might just do nothing if it detects some situation
where the flag was changed somehow, but I'm not 100% sure that the callers
are properly catering for when nothing happens.

IMO it would be better if this function would return true/false to mean "did
disable subscription happen or not?" because that will give the calling code the
chance to check the function return and do the right thing - e.g. if the caller first
thought it should be disabled but then it turned out it did NOT disable...

I don't think we need to do something more.
After this function, table sync worker and the apply worker
just exit. IMO, we don't need to do additional work for
already-disabled subscription on the caller sides.
It should be sufficient to fulfill the purpose of
DisableSubscriptionOnError or confirm it has been fulfilled.

Hmmm - Yeah, it may be the workers might just exit soon after anyhow
as you say so everything comes out in the wash, but still, I felt for
this case when DisableSubscriptionOnError turned out to do nothing it
would be better to exit via the existing logic. And that is easy to do
if the function returns true/false.

For example, changes like below seemed neater code to me. YMMV.

BEFORE (SyncTableStartWrapper):
if (MySubscription->disableonerr)
{
DisableSubscriptionOnError();
proc_exit(0);
}
AFTER
if (MySubscription->disableonerr && DisableSubscriptionOnError())
proc_exit(0);

BEFORE (ApplyLoopWrapper)
if (MySubscription->disableonerr)
{
/* Disable the subscription */
DisableSubscriptionOnError();
return;
}
AFTER
if (MySubscription->disableonerr && DisableSubscriptionOnError())
return;

~~~

Here are a couple more comments:

1. src/backend/replication/logical/worker.c -
DisableSubscriptionOnError, Refactor error handling

(this comment assumes the above gets changed too)

+static void
+DisableSubscriptionOnError(void)
+{
+ Relation rel;
+ bool nulls[Natts_pg_subscription];
+ bool replaces[Natts_pg_subscription];
+ Datum values[Natts_pg_subscription];
+ HeapTuple tup;
+ Form_pg_subscription subform;
+
+ /* Emit the error */
+ EmitErrorReport();
+ /* Abort any active transaction */
+ AbortOutOfAnyTransaction();
+ /* Reset the ErrorContext */
+ FlushErrorState();
+
+ /* Disable the subscription in a fresh transaction */
+ StartTransactionCommand();

If this DisableSubscriptionOnError function decides later that
actually the 'disableonerr' flag is false (i.e. it's NOT going to
disable the subscription after all) then IMO it make more sense that
the error logging for that case should just do whatever it is doing
now by the normal error processing mechanism.

In other words, I thought perhaps the code to
EmitErrorReport/FlushError state etc be moved to be BELOW the if
(!subform->subdisableonerr) bail-out code?

Please see what you think in my attached POC [1]peter-v19-poc.diff - POC just to try some of my suggestions above to make sure all tests still pass ok.. It seems neater to
me, and tests are all OK. Maybe I am mistaken...

~~~

2. Commit message - wording

Logical replication apply workers for a subscription can easily get
stuck in an infinite loop of attempting to apply a change,
triggering an error (such as a constraint violation), exiting with
an error written to the subscription worker log, and restarting.

SUGGESTION
"exiting with an error written" --> "exiting with the error written"

------
[1]: peter-v19-poc.diff - POC just to try some of my suggestions above to make sure all tests still pass ok.
to make sure all tests still pass ok.

Kind Regards,
Peter Smith.
Fujitsu Australia.

Attachments:

peter-v19-poc.diff.txttext/plain; charset=US-ASCII; name=peter-v19-poc.diff.txt
#83osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Peter Smith (#82)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Monday, February 21, 2022 2:56 PM Peter Smith <smithpb2250@gmail.com> wrote:

Thanks for addressing my previous comments. Now I have looked at v19.

On Mon, Feb 21, 2022 at 11:25 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Friday, February 18, 2022 3:27 PM Peter Smith

<smithpb2250@gmail.com> wrote:

Hi. Below are my code review comments for v18.

Thank you for your review !

...

5. src/backend/replication/logical/worker.c -
DisableSubscriptionOnError

+ /*
+ * We would not be here unless this subscription's disableonerr
+ field was
+ * true when our worker began applying changes, but check whether
+ that
+ * field has changed in the interim.
+ */

Apparently, this function might just do nothing if it detects some
situation where the flag was changed somehow, but I'm not 100% sure
that the callers are properly catering for when nothing happens.

IMO it would be better if this function would return true/false to
mean "did disable subscription happen or not?" because that will
give the calling code the chance to check the function return and do
the right thing - e.g. if the caller first thought it should be disabled but then

it turned out it did NOT disable...

I don't think we need to do something more.
After this function, table sync worker and the apply worker just exit.
IMO, we don't need to do additional work for already-disabled
subscription on the caller sides.
It should be sufficient to fulfill the purpose of
DisableSubscriptionOnError or confirm it has been fulfilled.

Hmmm - Yeah, it may be the workers might just exit soon after anyhow as you
say so everything comes out in the wash, but still, I felt for this case when
DisableSubscriptionOnError turned out to do nothing it would be better to exit
via the existing logic. And that is easy to do if the function returns true/false.

For example, changes like below seemed neater code to me. YMMV.

BEFORE (SyncTableStartWrapper):
if (MySubscription->disableonerr)
{
DisableSubscriptionOnError();
proc_exit(0);
}
AFTER
if (MySubscription->disableonerr && DisableSubscriptionOnError())
proc_exit(0);

BEFORE (ApplyLoopWrapper)
if (MySubscription->disableonerr)
{
/* Disable the subscription */
DisableSubscriptionOnError();
return;
}
AFTER
if (MySubscription->disableonerr && DisableSubscriptionOnError()) return;

Okay, so this return value works for better readability.
Fixed.

~~~

Here are a couple more comments:

1. src/backend/replication/logical/worker.c - DisableSubscriptionOnError,
Refactor error handling

(this comment assumes the above gets changed too)

I think those are independent.

+static void
+DisableSubscriptionOnError(void)
+{
+ Relation rel;
+ bool nulls[Natts_pg_subscription];
+ bool replaces[Natts_pg_subscription];
+ Datum values[Natts_pg_subscription];
+ HeapTuple tup;
+ Form_pg_subscription subform;
+
+ /* Emit the error */
+ EmitErrorReport();
+ /* Abort any active transaction */
+ AbortOutOfAnyTransaction();
+ /* Reset the ErrorContext */
+ FlushErrorState();
+
+ /* Disable the subscription in a fresh transaction */
+ StartTransactionCommand();

If this DisableSubscriptionOnError function decides later that actually the
'disableonerr' flag is false (i.e. it's NOT going to disable the subscription after
all) then IMO it make more sense that the error logging for that case should just
do whatever it is doing now by the normal error processing mechanism.

In other words, I thought perhaps the code to EmitErrorReport/FlushError state
etc be moved to be BELOW the if
(!subform->subdisableonerr) bail-out code?

Please see what you think in my attached POC [1]. It seems neater to me, and
tests are all OK. Maybe I am mistaken...

I had a concern that this order change of codes would have a negative
impact when we have another new error during the call of DisableSubscriptionOnError.

With the debugger, I raised an error in this function before emitting the original error.
As a result, the original error that makes the apply worker go into the path of
DisableSubscriptionOnError (in my test, duplication error) has vanished.
In this sense, v19 looks safer, and the current order to handle error recovery first
looks better to me.

FYI, after the 2nd debugger error,
the next new apply worker created quickly met the same type of error,
went into the same path, and disabled the subscription with the log.
But, it won't be advisable to let the possibility left.

~~~

2. Commit message - wording

Logical replication apply workers for a subscription can easily get stuck in an
infinite loop of attempting to apply a change, triggering an error (such as a
constraint violation), exiting with an error written to the subscription worker log,
and restarting.

SUGGESTION
"exiting with an error written" --> "exiting with the error written"

Fixed.

------
[1] peter-v19-poc.diff - POC just to try some of my suggestions above to make
sure all tests still pass ok.

Thanks ! I included you as co-author, because
you shared meaningful patches for me.

Best Regards,
Takamichi Osumi

Attachments:

v20-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v20-0001-Optionally-disable-subscriptions-on-error.patch
#84Peter Smith
Peter Smith
smithpb2250@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#83)
Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Feb 21, 2022 at 11:44 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Monday, February 21, 2022 2:56 PM Peter Smith <smithpb2250@gmail.com> wrote:

Thanks for addressing my previous comments. Now I have looked at v19.

On Mon, Feb 21, 2022 at 11:25 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Friday, February 18, 2022 3:27 PM Peter Smith

<smithpb2250@gmail.com> wrote:

Hi. Below are my code review comments for v18.

Thank you for your review !

...

5. src/backend/replication/logical/worker.c -
DisableSubscriptionOnError

+ /*
+ * We would not be here unless this subscription's disableonerr
+ field was
+ * true when our worker began applying changes, but check whether
+ that
+ * field has changed in the interim.
+ */

Apparently, this function might just do nothing if it detects some
situation where the flag was changed somehow, but I'm not 100% sure
that the callers are properly catering for when nothing happens.

IMO it would be better if this function would return true/false to
mean "did disable subscription happen or not?" because that will
give the calling code the chance to check the function return and do
the right thing - e.g. if the caller first thought it should be disabled but then

it turned out it did NOT disable...

I don't think we need to do something more.
After this function, table sync worker and the apply worker just exit.
IMO, we don't need to do additional work for already-disabled
subscription on the caller sides.
It should be sufficient to fulfill the purpose of
DisableSubscriptionOnError or confirm it has been fulfilled.

Hmmm - Yeah, it may be the workers might just exit soon after anyhow as you
say so everything comes out in the wash, but still, I felt for this case when
DisableSubscriptionOnError turned out to do nothing it would be better to exit
via the existing logic. And that is easy to do if the function returns true/false.

For example, changes like below seemed neater code to me. YMMV.

BEFORE (SyncTableStartWrapper):
if (MySubscription->disableonerr)
{
DisableSubscriptionOnError();
proc_exit(0);
}
AFTER
if (MySubscription->disableonerr && DisableSubscriptionOnError())
proc_exit(0);

BEFORE (ApplyLoopWrapper)
if (MySubscription->disableonerr)
{
/* Disable the subscription */
DisableSubscriptionOnError();
return;
}
AFTER
if (MySubscription->disableonerr && DisableSubscriptionOnError()) return;

Okay, so this return value works for better readability.
Fixed.

~~~

Here are a couple more comments:

1. src/backend/replication/logical/worker.c - DisableSubscriptionOnError,
Refactor error handling

(this comment assumes the above gets changed too)

I think those are independent.

OK. I was only curious if the change #5 above might cause the error to
be logged 2x, if the DisableSubscriptionOnError returns false.
- firstly, when it logs errors within the function
- secondly, by normal error mechanism when the caller re-throws it.

But, if you are sure that won't happen then it is good news.

+static void
+DisableSubscriptionOnError(void)
+{
+ Relation rel;
+ bool nulls[Natts_pg_subscription];
+ bool replaces[Natts_pg_subscription];
+ Datum values[Natts_pg_subscription];
+ HeapTuple tup;
+ Form_pg_subscription subform;
+
+ /* Emit the error */
+ EmitErrorReport();
+ /* Abort any active transaction */
+ AbortOutOfAnyTransaction();
+ /* Reset the ErrorContext */
+ FlushErrorState();
+
+ /* Disable the subscription in a fresh transaction */
+ StartTransactionCommand();

If this DisableSubscriptionOnError function decides later that actually the
'disableonerr' flag is false (i.e. it's NOT going to disable the subscription after
all) then IMO it make more sense that the error logging for that case should just
do whatever it is doing now by the normal error processing mechanism.

In other words, I thought perhaps the code to EmitErrorReport/FlushError state
etc be moved to be BELOW the if
(!subform->subdisableonerr) bail-out code?

Please see what you think in my attached POC [1]. It seems neater to me, and
tests are all OK. Maybe I am mistaken...

I had a concern that this order change of codes would have a negative
impact when we have another new error during the call of DisableSubscriptionOnError.

With the debugger, I raised an error in this function before emitting the original error.
As a result, the original error that makes the apply worker go into the path of
DisableSubscriptionOnError (in my test, duplication error) has vanished.
In this sense, v19 looks safer, and the current order to handle error recovery first
looks better to me.

FYI, after the 2nd debugger error,
the next new apply worker created quickly met the same type of error,
went into the same path, and disabled the subscription with the log.
But, it won't be advisable to let the possibility left.

OK - thanks for checking it.

Will it be better to put some comments about that? Something like --

BEFORE
/* Emit the error */
EmitErrorReport();
/* Abort any active transaction */
AbortOutOfAnyTransaction();
/* Reset the ErrorContext */
FlushErrorState();

/* Disable the subscription in a fresh transaction */
StartTransactionCommand();

AFTER
/* Disable the subscription in a fresh transaction */
AbortOutOfAnyTransaction();
StartTransactionCommand();

/*
* Log the error that caused DisableSubscriptionOnError to be called.
* We do this immediately so that it won't be lost if some other internal
* error occurs in the following code,
*/
EmitErrorReport();
FlushErrorState();

...

------
[1] peter-v19-poc.diff - POC just to try some of my suggestions above to make
sure all tests still pass ok.

Thanks ! I included you as co-author, because
you shared meaningful patches for me.

Thanks!

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#85osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Peter Smith (#84)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Tuesday, February 22, 2022 7:53 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, Feb 21, 2022 at 11:44 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Monday, February 21, 2022 2:56 PM Peter Smith

<smithpb2250@gmail.com> wrote:

Thanks for addressing my previous comments. Now I have looked at v19.

On Mon, Feb 21, 2022 at 11:25 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Friday, February 18, 2022 3:27 PM Peter Smith

<smithpb2250@gmail.com> wrote:

Hi. Below are my code review comments for v18.

Thank you for your review !

...

5. src/backend/replication/logical/worker.c -
DisableSubscriptionOnError

+ /*
+ * We would not be here unless this subscription's disableonerr
+ field was
+ * true when our worker began applying changes, but check
+ whether that
+ * field has changed in the interim.
+ */

Apparently, this function might just do nothing if it detects
some situation where the flag was changed somehow, but I'm not
100% sure that the callers are properly catering for when nothing

happens.

IMO it would be better if this function would return true/false
to mean "did disable subscription happen or not?" because that
will give the calling code the chance to check the function
return and do the right thing - e.g. if the caller first thought
it should be disabled but then

it turned out it did NOT disable...

I don't think we need to do something more.
After this function, table sync worker and the apply worker just exit.
IMO, we don't need to do additional work for already-disabled
subscription on the caller sides.
It should be sufficient to fulfill the purpose of
DisableSubscriptionOnError or confirm it has been fulfilled.

Hmmm - Yeah, it may be the workers might just exit soon after
anyhow as you say so everything comes out in the wash, but still, I
felt for this case when DisableSubscriptionOnError turned out to do
nothing it would be better to exit via the existing logic. And that is easy to do

if the function returns true/false.

For example, changes like below seemed neater code to me. YMMV.

BEFORE (SyncTableStartWrapper):
if (MySubscription->disableonerr)
{
DisableSubscriptionOnError();
proc_exit(0);
}
AFTER
if (MySubscription->disableonerr && DisableSubscriptionOnError())
proc_exit(0);

BEFORE (ApplyLoopWrapper)
if (MySubscription->disableonerr)
{
/* Disable the subscription */
DisableSubscriptionOnError();
return;
}
AFTER
if (MySubscription->disableonerr && DisableSubscriptionOnError())
return;

Okay, so this return value works for better readability.
Fixed.

~~~

Here are a couple more comments:

1. src/backend/replication/logical/worker.c -
DisableSubscriptionOnError, Refactor error handling

(this comment assumes the above gets changed too)

I think those are independent.

OK. I was only curious if the change #5 above might cause the error to be logged
2x, if the DisableSubscriptionOnError returns false.
- firstly, when it logs errors within the function
- secondly, by normal error mechanism when the caller re-throws it.

But, if you are sure that won't happen then it is good news.

I didn't feel this would become a substantial issue.

When we alter subscription with disable_on_error = false
after we go into the DisableSubscriptionOnError,
we don't disable the subscription in the same function.
That means we launch new apply workers repeatedly after that
until we solve the error cause or we set the disable_on_error = true again.

So, if we confirm that the disable_on_error = false in the DisableSubscriptionOnError,
it's highly possible that we'll get more same original errors from new apply workers.

This leads to another question, we should suppress the 2nd error(if there is),
even when it's highly possible that we'll get more same errors by new apply workers
created repeatedly or not. I wasn't sure if the implementation complexity for this wins the log print.

So, kindly let me keep the current code as is.
If someone wants share his/her opinion on this, please let me know.

+static void
+DisableSubscriptionOnError(void)
+{
+ Relation rel;
+ bool nulls[Natts_pg_subscription];  bool
+replaces[Natts_pg_subscription];  Datum
+values[Natts_pg_subscription];  HeapTuple tup;
+Form_pg_subscription subform;
+
+ /* Emit the error */
+ EmitErrorReport();
+ /* Abort any active transaction */ AbortOutOfAnyTransaction();
+ /* Reset the ErrorContext */
+ FlushErrorState();
+
+ /* Disable the subscription in a fresh transaction */
+ StartTransactionCommand();

If this DisableSubscriptionOnError function decides later that
actually the 'disableonerr' flag is false (i.e. it's NOT going to
disable the subscription after
all) then IMO it make more sense that the error logging for that
case should just do whatever it is doing now by the normal error processing

mechanism.

In other words, I thought perhaps the code to
EmitErrorReport/FlushError state etc be moved to be BELOW the if
(!subform->subdisableonerr) bail-out code?

Please see what you think in my attached POC [1]. It seems neater to
me, and tests are all OK. Maybe I am mistaken...

I had a concern that this order change of codes would have a negative
impact when we have another new error during the call of

DisableSubscriptionOnError.

With the debugger, I raised an error in this function before emitting the

original error.

As a result, the original error that makes the apply worker go into
the path of DisableSubscriptionOnError (in my test, duplication error) has

vanished.

In this sense, v19 looks safer, and the current order to handle error
recovery first looks better to me.

FYI, after the 2nd debugger error,
the next new apply worker created quickly met the same type of error,
went into the same path, and disabled the subscription with the log.
But, it won't be advisable to let the possibility left.

OK - thanks for checking it.

Will it be better to put some comments about that? Something like --

BEFORE
/* Emit the error */
EmitErrorReport();
/* Abort any active transaction */
AbortOutOfAnyTransaction();
/* Reset the ErrorContext */
FlushErrorState();

/* Disable the subscription in a fresh transaction */
StartTransactionCommand();

AFTER
/* Disable the subscription in a fresh transaction */
AbortOutOfAnyTransaction(); StartTransactionCommand();

/*
* Log the error that caused DisableSubscriptionOnError to be called.
* We do this immediately so that it won't be lost if some other internal
* error occurs in the following code,
*/
EmitErrorReport();
FlushErrorState();

I appreciate your suggestion. Yet, I'd like to keep the current order of my patch.
The FlushErrorState's comment mentions we are not out of the error subsystem until
we call this and starting a new transaction before it didn't sound a good idea.
But, I've fixed the comments around this. The indentation for new comments
are checked by pgindent.

Best Regards,
Takamichi Osumi

Attachments:

v21-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v21-0001-Optionally-disable-subscriptions-on-error.patch
#86Peter Smith
Peter Smith
smithpb2250@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#85)
Re: Optionally automatically disable logical replication subscriptions on error

On Tue, Feb 22, 2022 at 3:11 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Tuesday, February 22, 2022 7:53 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, Feb 21, 2022 at 11:44 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Monday, February 21, 2022 2:56 PM Peter Smith

<smithpb2250@gmail.com> wrote:

Thanks for addressing my previous comments. Now I have looked at v19.

On Mon, Feb 21, 2022 at 11:25 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Friday, February 18, 2022 3:27 PM Peter Smith

<smithpb2250@gmail.com> wrote:

Hi. Below are my code review comments for v18.

Thank you for your review !

...

5. src/backend/replication/logical/worker.c -
DisableSubscriptionOnError

+ /*
+ * We would not be here unless this subscription's disableonerr
+ field was
+ * true when our worker began applying changes, but check
+ whether that
+ * field has changed in the interim.
+ */

Apparently, this function might just do nothing if it detects
some situation where the flag was changed somehow, but I'm not
100% sure that the callers are properly catering for when nothing

happens.

IMO it would be better if this function would return true/false
to mean "did disable subscription happen or not?" because that
will give the calling code the chance to check the function
return and do the right thing - e.g. if the caller first thought
it should be disabled but then

it turned out it did NOT disable...

I don't think we need to do something more.
After this function, table sync worker and the apply worker just exit.
IMO, we don't need to do additional work for already-disabled
subscription on the caller sides.
It should be sufficient to fulfill the purpose of
DisableSubscriptionOnError or confirm it has been fulfilled.

Hmmm - Yeah, it may be the workers might just exit soon after
anyhow as you say so everything comes out in the wash, but still, I
felt for this case when DisableSubscriptionOnError turned out to do
nothing it would be better to exit via the existing logic. And that is easy to do

if the function returns true/false.

For example, changes like below seemed neater code to me. YMMV.

BEFORE (SyncTableStartWrapper):
if (MySubscription->disableonerr)
{
DisableSubscriptionOnError();
proc_exit(0);
}
AFTER
if (MySubscription->disableonerr && DisableSubscriptionOnError())
proc_exit(0);

BEFORE (ApplyLoopWrapper)
if (MySubscription->disableonerr)
{
/* Disable the subscription */
DisableSubscriptionOnError();
return;
}
AFTER
if (MySubscription->disableonerr && DisableSubscriptionOnError())
return;

Okay, so this return value works for better readability.
Fixed.

~~~

Here are a couple more comments:

1. src/backend/replication/logical/worker.c -
DisableSubscriptionOnError, Refactor error handling

(this comment assumes the above gets changed too)

I think those are independent.

OK. I was only curious if the change #5 above might cause the error to be logged
2x, if the DisableSubscriptionOnError returns false.
- firstly, when it logs errors within the function
- secondly, by normal error mechanism when the caller re-throws it.

But, if you are sure that won't happen then it is good news.

I didn't feel this would become a substantial issue.

When we alter subscription with disable_on_error = false
after we go into the DisableSubscriptionOnError,
we don't disable the subscription in the same function.
That means we launch new apply workers repeatedly after that
until we solve the error cause or we set the disable_on_error = true again.

So, if we confirm that the disable_on_error = false in the DisableSubscriptionOnError,
it's highly possible that we'll get more same original errors from new apply workers.

This leads to another question, we should suppress the 2nd error(if there is),
even when it's highly possible that we'll get more same errors by new apply workers
created repeatedly or not. I wasn't sure if the implementation complexity for this wins the log print.

So, kindly let me keep the current code as is.
If someone wants share his/her opinion on this, please let me know.

OK, but is it really correct that this scenario can happen "When we
alter subscription with disable_on_error = false after we go into the
DisableSubscriptionOnError". Actually, I thought this window may be
much bigger than that - e.g. maybe we changed the option to false at
*any* time after the worker was originally started and the original
option values were got by GetSubscription function (and that might be
hours/days/weeks ago since it started).

+static void
+DisableSubscriptionOnError(void)
+{
+ Relation rel;
+ bool nulls[Natts_pg_subscription];  bool
+replaces[Natts_pg_subscription];  Datum
+values[Natts_pg_subscription];  HeapTuple tup;
+Form_pg_subscription subform;
+
+ /* Emit the error */
+ EmitErrorReport();
+ /* Abort any active transaction */ AbortOutOfAnyTransaction();
+ /* Reset the ErrorContext */
+ FlushErrorState();
+
+ /* Disable the subscription in a fresh transaction */
+ StartTransactionCommand();

If this DisableSubscriptionOnError function decides later that
actually the 'disableonerr' flag is false (i.e. it's NOT going to
disable the subscription after
all) then IMO it make more sense that the error logging for that
case should just do whatever it is doing now by the normal error processing

mechanism.

In other words, I thought perhaps the code to
EmitErrorReport/FlushError state etc be moved to be BELOW the if
(!subform->subdisableonerr) bail-out code?

Please see what you think in my attached POC [1]. It seems neater to
me, and tests are all OK. Maybe I am mistaken...

I had a concern that this order change of codes would have a negative
impact when we have another new error during the call of

DisableSubscriptionOnError.

With the debugger, I raised an error in this function before emitting the

original error.

As a result, the original error that makes the apply worker go into
the path of DisableSubscriptionOnError (in my test, duplication error) has

vanished.

In this sense, v19 looks safer, and the current order to handle error
recovery first looks better to me.

FYI, after the 2nd debugger error,
the next new apply worker created quickly met the same type of error,
went into the same path, and disabled the subscription with the log.
But, it won't be advisable to let the possibility left.

OK - thanks for checking it.

Will it be better to put some comments about that? Something like --

BEFORE
/* Emit the error */
EmitErrorReport();
/* Abort any active transaction */
AbortOutOfAnyTransaction();
/* Reset the ErrorContext */
FlushErrorState();

/* Disable the subscription in a fresh transaction */
StartTransactionCommand();

AFTER
/* Disable the subscription in a fresh transaction */
AbortOutOfAnyTransaction(); StartTransactionCommand();

/*
* Log the error that caused DisableSubscriptionOnError to be called.
* We do this immediately so that it won't be lost if some other internal
* error occurs in the following code,
*/
EmitErrorReport();
FlushErrorState();

I appreciate your suggestion. Yet, I'd like to keep the current order of my patch.
The FlushErrorState's comment mentions we are not out of the error subsystem until
we call this and starting a new transaction before it didn't sound a good idea.
But, I've fixed the comments around this. The indentation for new comments
are checked by pgindent.

OK.

======

Here are a couple more review comments for v21.

~~~

1. worker.c - comment

+ subform = (Form_pg_subscription) GETSTRUCT(tup);
+
+ /*
+ * We would not be here unless this subscription's disableonerr field was
+ * true, but check whether that field has changed in the interim.
+ */
+ if (!subform->subdisableonerr)
+ {
+ heap_freetuple(tup);
+ table_close(rel, RowExclusiveLock);
+ CommitTransactionCommand();
+ return false;
+ }

I felt that comment belongs above the subform assignment because that
is the only reason we are getting the subform again.

~~

2. worker.c - subform->oid

+ /* Notify the subscription will be no longer valid */
+ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" will be disabled due
to an error",
+    MySubscription->name));
+
+ LockSharedObject(SubscriptionRelationId, subform->oid, 0,
AccessExclusiveLock);

Can't we just use MySubscription->oid here? We really only needed that
subform to get new option values.

~~

3. worker.c - whitespace

Your pg_indent has also changed some whitespace for parts of worker.c
that are completely unrelated to this patch. You might want to revert
those changes.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#87tanghy.fnst@fujitsu.com
tanghy.fnst@fujitsu.com
tanghy.fnst@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#85)
RE: Optionally automatically disable logical replication subscriptions on error

Hi Osumi-san,

I have a comment on v21 patch.

I wonder if we really need subscription s2 in 028_disable_on_error.pl. I think
for subscription s2, we only tested some normal cases(which could be tested with s1),
and didn't test any error case, which means it wouldn't be automatically disabled.
Is there any reason for creating subscription s2?

Regards,
Tang

#88osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: tanghy.fnst@fujitsu.com (#87)
RE: Optionally automatically disable logical replication subscriptions on error

On Wednesday, February 23, 2022 6:52 PM Tang, Haiying/唐 海英 <tanghy.fnst@fujitsu.com> wrote:

I have a comment on v21 patch.

I wonder if we really need subscription s2 in 028_disable_on_error.pl. I think for
subscription s2, we only tested some normal cases(which could be tested with
s1), and didn't test any error case, which means it wouldn't be automatically
disabled.
Is there any reason for creating subscription s2?

Hi, thank you for your review !

It's for checking there's no impact/influence when disabling one subscription
on the other subscription if any.

*But*, when I have a look at the past tests to add options (e.g. streaming,
two_phase), we don't have this kind of test that I have for disable_on_error patch.
Therefore, I'd like to fix the test as you suggested in my next version.

Best Regards,
Takamichi Osumi

#89Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#86)
Re: Optionally automatically disable logical replication subscriptions on error

On Tue, Feb 22, 2022 at 3:03 PM Peter Smith <smithpb2250@gmail.com> wrote:

~~~

1. worker.c - comment

+ subform = (Form_pg_subscription) GETSTRUCT(tup);
+
+ /*
+ * We would not be here unless this subscription's disableonerr field was
+ * true, but check whether that field has changed in the interim.
+ */
+ if (!subform->subdisableonerr)
+ {
+ heap_freetuple(tup);
+ table_close(rel, RowExclusiveLock);
+ CommitTransactionCommand();
+ return false;
+ }

I felt that comment belongs above the subform assignment because that
is the only reason we are getting the subform again.

IIUC if we return false here, the same error will be emitted twice.
And I'm not sure this check is really necessary. It would work only
when the subdisableonerr is set to false concurrently, but doesn't
work for the opposite caces. I think we can check
MySubscription->disableonerr and then just update the tuple.

Here are some comments:

Why do we need SyncTableStartWrapper() and ApplyLoopWrapper()?

---
+       /*
+        * Log the error that caused DisableSubscriptionOnError to be called. We
+        * do this immediately so that it won't be lost if some other internal
+        * error occurs in the following code.
+        */
+       EmitErrorReport();
+       AbortOutOfAnyTransaction();
+       FlushErrorState();

Do we need to hold interrupts during cleanup here?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#90Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#89)
Re: Optionally automatically disable logical replication subscriptions on error

On Thu, Feb 24, 2022 at 1:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Here are some comments:

Why do we need SyncTableStartWrapper() and ApplyLoopWrapper()?

I have given this comment to move the related code to separate
functions to slightly simplify ApplyWorkerMain() code but if you don't
like we can move it back. I am not sure I like the new function names
in the patch though.

---
+       /*
+        * Log the error that caused DisableSubscriptionOnError to be called. We
+        * do this immediately so that it won't be lost if some other internal
+        * error occurs in the following code.
+        */
+       EmitErrorReport();
+       AbortOutOfAnyTransaction();
+       FlushErrorState();

Do we need to hold interrupts during cleanup here?

I think so. We do prevent interrupts via
HOLD_INTERRUPTS/RESUME_INTERRUPTS during cleanup.

--
With Regards,
Amit Kapila.

#91Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#90)
Re: Optionally automatically disable logical replication subscriptions on error

On Thu, Feb 24, 2022 at 8:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 24, 2022 at 1:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Here are some comments:

Why do we need SyncTableStartWrapper() and ApplyLoopWrapper()?

I have given this comment to move the related code to separate
functions to slightly simplify ApplyWorkerMain() code but if you don't
like we can move it back. I am not sure I like the new function names
in the patch though.

Okay, I'm fine with moving this code but perhaps we can find a better
function name as "Wrapper" seems slightly odd to me. For example,
start_table_sync_start() and start_apply_changes() or something (it
seems we use the snake case for static functions in worker.c).

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#92Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#91)
Re: Optionally automatically disable logical replication subscriptions on error

On Thu, Feb 24, 2022 at 6:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Feb 24, 2022 at 8:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 24, 2022 at 1:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Here are some comments:

Why do we need SyncTableStartWrapper() and ApplyLoopWrapper()?

I have given this comment to move the related code to separate
functions to slightly simplify ApplyWorkerMain() code but if you don't
like we can move it back. I am not sure I like the new function names
in the patch though.

Okay, I'm fine with moving this code but perhaps we can find a better
function name as "Wrapper" seems slightly odd to me.

Agreed.

For example,
start_table_sync_start() and start_apply_changes() or something (it
seems we use the snake case for static functions in worker.c).

I am fine with something on these lines, how about start_table_sync()
and start_apply() respectively?

--
With Regards,
Amit Kapila.

#93osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#92)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Friday, February 25, 2022 12:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 24, 2022 at 6:30 PM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

On Thu, Feb 24, 2022 at 8:08 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Thu, Feb 24, 2022 at 1:20 PM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

Here are some comments:

Why do we need SyncTableStartWrapper() and ApplyLoopWrapper()?

I have given this comment to move the related code to separate
functions to slightly simplify ApplyWorkerMain() code but if you
don't like we can move it back. I am not sure I like the new
function names in the patch though.

Okay, I'm fine with moving this code but perhaps we can find a better
function name as "Wrapper" seems slightly odd to me.

Agreed.

For example,
start_table_sync_start() and start_apply_changes() or something (it
seems we use the snake case for static functions in worker.c).

I am fine with something on these lines, how about start_table_sync() and
start_apply() respectively?

Adopted. (If we come up with better names, we can change those then)

Kindly have a look at attached the v22.
It has incorporated other improvements of TAP test,
refinement of the DisableSubscriptionOnError function and so on.

Best Regards,
Takamichi Osumi

Attachments:

v22-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v22-0001-Optionally-disable-subscriptions-on-error.patch
#94osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#90)
RE: Optionally automatically disable logical replication subscriptions on error

On Thursday, February 24, 2022 8:09 PM Amit Kapila <amit.kapila16@gmail.com>

On Thu, Feb 24, 2022 at 1:20 PM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

+       /*
+        * Log the error that caused DisableSubscriptionOnError to be

called. We

+        * do this immediately so that it won't be lost if some other internal
+        * error occurs in the following code.
+        */
+       EmitErrorReport();
+       AbortOutOfAnyTransaction();
+       FlushErrorState();

Do we need to hold interrupts during cleanup here?

I think so. We do prevent interrupts via
HOLD_INTERRUPTS/RESUME_INTERRUPTS during cleanup.

Fixed.

Kindly have a look at v22 shared in [1]/messages/by-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9@TYCPR01MB8373.jpnprd01.prod.outlook.com.

[1]: /messages/by-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9@TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi

#95osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Masahiko Sawada (#89)
RE: Optionally automatically disable logical replication subscriptions on error

On Thursday, February 24, 2022 4:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Feb 22, 2022 at 3:03 PM Peter Smith <smithpb2250@gmail.com>
wrote:

~~~

1. worker.c - comment

+ subform = (Form_pg_subscription) GETSTRUCT(tup);
+
+ /*
+ * We would not be here unless this subscription's disableonerr field
+ was
+ * true, but check whether that field has changed in the interim.
+ */
+ if (!subform->subdisableonerr)
+ {
+ heap_freetuple(tup);
+ table_close(rel, RowExclusiveLock);
+ CommitTransactionCommand();
+ return false;
+ }

I felt that comment belongs above the subform assignment because that
is the only reason we are getting the subform again.

IIUC if we return false here, the same error will be emitted twice.
And I'm not sure this check is really necessary. It would work only when the
subdisableonerr is set to false concurrently, but doesn't work for the opposite
caces. I think we can check
MySubscription->disableonerr and then just update the tuple.

Addressed. I followed your advice and deleted the check.

Kindly have a look at v22 shared in [1]/messages/by-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9@TYCPR01MB8373.jpnprd01.prod.outlook.com.

[1]: /messages/by-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9@TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi

#96osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: tanghy.fnst@fujitsu.com (#87)
RE: Optionally automatically disable logical replication subscriptions on error

On Wednesday, February 23, 2022 6:52 PM Tang, Haiying/唐 海英 <tanghy.fnst@fujitsu.com> wrote:

I have a comment on v21 patch.

I wonder if we really need subscription s2 in 028_disable_on_error.pl. I think for
subscription s2, we only tested some normal cases(which could be tested with
s1), and didn't test any error case, which means it wouldn't be automatically
disabled.
Is there any reason for creating subscription s2?

Removed the subscription s2.

This has reduced the code amount of TAP tests.
Kindly have a look at the v22 shared in [1]/messages/by-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9@TYCPR01MB8373.jpnprd01.prod.outlook.com.

[1]: /messages/by-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9@TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi

#97osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Peter Smith (#86)
RE: Optionally automatically disable logical replication subscriptions on error

On Tuesday, February 22, 2022 3:03 PM Peter Smith <smithpb2250@gmail.com> wrote:

Here are a couple more review comments for v21.

~~~

1. worker.c - comment

+ subform = (Form_pg_subscription) GETSTRUCT(tup);
+
+ /*
+ * We would not be here unless this subscription's disableonerr field
+ was
+ * true, but check whether that field has changed in the interim.
+ */
+ if (!subform->subdisableonerr)
+ {
+ heap_freetuple(tup);
+ table_close(rel, RowExclusiveLock);
+ CommitTransactionCommand();
+ return false;
+ }

I felt that comment belongs above the subform assignment because that is the
only reason we are getting the subform again.

This part has been removed along with the modification
that we just disable the subscription in the main processing
when we get an error.

~~

2. worker.c - subform->oid

+ /* Notify the subscription will be no longer valid */ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" will be disabled due
to an error",
+    MySubscription->name));
+
+ LockSharedObject(SubscriptionRelationId, subform->oid, 0,
AccessExclusiveLock);

Can't we just use MySubscription->oid here? We really only needed that
subform to get new option values.

Fixed.

~~

3. worker.c - whitespace

Your pg_indent has also changed some whitespace for parts of worker.c that
are completely unrelated to this patch. You might want to revert those changes.

Fixed.

Kindly have a look at v22 that took in all your comments.
It's shared in [1]/messages/by-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9@TYCPR01MB8373.jpnprd01.prod.outlook.com.

[1]: /messages/by-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9@TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi

#98Peter Smith
Peter Smith
smithpb2250@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#93)
Re: Optionally automatically disable logical replication subscriptions on error

Please see below my review comments for v22.

======

1. Commit message

"table sync worker" -> "tablesync worker"

~~~

2. doc/src/sgml/catalogs.sgml

+      <para>
+       If true, the subscription will be disabled when subscription
+       workers detect any errors
+      </para></entry>

It felt a bit strange to say "subscription" 2x in the sentence, but I
am not sure how to improve it. Maybe like below?

BEFORE
If true, the subscription will be disabled when subscription workers
detect any errors

SUGGESTED
If true, the subscription will be disabled if one of its workers
detects an error

~~~

3. src/backend/replication/logical/worker.c - DisableSubscriptionOnError

@@ -2802,6 +2803,69 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
}

 /*
+ * Disable the current subscription, after error recovery processing.
+ */
+static void
+DisableSubscriptionOnError(void)

I thought the "after error recovery processing" part was a bit generic
and did not really say what it was doing.

BEFORE
Disable the current subscription, after error recovery processing.
SUGGESTED
Disable the current subscription, after logging the error that caused
this function to be called.

~~~

4. src/backend/replication/logical/worker.c - start_apply

+ if (MySubscription->disableonerr)
+ {
+ DisableSubscriptionOnError();
+ return;
+ }
+
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();

The current code looks correct, but I felt it is a bit tricky to
easily see the execution path after the return.

Since it will effectively just exit anyhow I think it will be simpler
just to do that explicitly right here instead of the 'return'. This
will also make the code consistent with the same 'disableonerr' logic
of the start_start_sync.

SUGGESTION
if (MySubscription->disableonerr)
{
DisableSubscriptionOnError();
proc_exit(0);
}

~~~

5. src/bin/pg_dump/pg_dump.c

@@ -4463,6 +4473,9 @@ dumpSubscription(Archive *fout, const
SubscriptionInfo *subinfo)
if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
appendPQExpBufferStr(query, ", two_phase = on");

+ if (strcmp(subinfo->subdisableonerr, "f") != 0)
+ appendPQExpBufferStr(query, ", disable_on_error = true");
+

Although the code is correct, I think it would be more natural to set
this option as true when the user wants it true. e.g. check for "t"
same as 'subbinary' is doing. This way, even if there was some
unknown/corrupted value the code would do nothing, which is the
behaviour you want...

SUGGESTION
if (strcmp(subinfo->subdisableonerr, "t") == 0)

~~~

6. src/include/catalog/pg_subscription.h

@@ -67,6 +67,9 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId)
BKI_SHARED_RELATION BKI_ROW

char subtwophasestate; /* Stream two-phase transactions */

+ bool subdisableonerr; /* True if occurrence of apply errors
+ * should disable the subscription */

The comment seems not quite right because it's not just about apply
errors. E.g. I think any error in tablesync will cause disablement
too.

BEFORE
True if occurrence of apply errors should disable the subscription
SUGGESTED
True if a worker error should cause the subscription to be disabled

~~~

7. src/test/regress/sql/subscription.sql - whitespace

+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION
'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
false, disable_on_error = false);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
+
+\dRs+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+

I think should be a blank line after that last \dRs+ just like the
other one, because it belongs logically with the code above it, not
with the ALTER slot_name.

~~~

8. src/test/subscription/t/028_disable_on_error.pl - filename

The 028 number needs to be bumped because there is already a TAP test
called 028 now

~~~

9. src/test/subscription/t/028_disable_on_error.pl - missing test

There was no test case for the last combination where the user correct
the apply worker problem: E.g. After a previous error/disable of the
subscriber, remove the index, publish the inserts again, and check
they get applied properly.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#99osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#93)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Friday, February 25, 2022 9:45 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote:

Kindly have a look at attached the v22.
It has incorporated other improvements of TAP test, refinement of the
DisableSubscriptionOnError function and so on.

The recent commit(7a85073) has changed the subscription workers
error handling. So, I rebased my disable_on_error patch first
for anyone who are interested in the review.

I'll incorporate incoming comments for v22 in my next version.

Best Regards,
Takamichi Osumi

Attachments:

v23-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v23-0001-Optionally-disable-subscriptions-on-error.patch
#100osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Peter Smith (#98)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Tuesday, March 1, 2022 9:49 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please see below my review comments for v22.

======

1. Commit message

"table sync worker" -> "tablesync worker"

Fixed.

~~~

2. doc/src/sgml/catalogs.sgml

+      <para>
+       If true, the subscription will be disabled when subscription
+       workers detect any errors
+      </para></entry>

It felt a bit strange to say "subscription" 2x in the sentence, but I am not sure
how to improve it. Maybe like below?

BEFORE
If true, the subscription will be disabled when subscription workers detect any
errors

SUGGESTED
If true, the subscription will be disabled if one of its workers detects an error

Fixed.

~~~

3. src/backend/replication/logical/worker.c - DisableSubscriptionOnError

@@ -2802,6 +2803,69 @@ LogicalRepApplyLoop(XLogRecPtr
last_received) }

/*
+ * Disable the current subscription, after error recovery processing.
+ */
+static void
+DisableSubscriptionOnError(void)

I thought the "after error recovery processing" part was a bit generic and did not
really say what it was doing.

BEFORE
Disable the current subscription, after error recovery processing.
SUGGESTED
Disable the current subscription, after logging the error that caused this
function to be called.

Fixed.

~~~

4. src/backend/replication/logical/worker.c - start_apply

+ if (MySubscription->disableonerr)
+ {
+ DisableSubscriptionOnError();
+ return;
+ }
+
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();

The current code looks correct, but I felt it is a bit tricky to easily see the
execution path after the return.

Since it will effectively just exit anyhow I think it will be simpler just to do that
explicitly right here instead of the 'return'. This will also make the code
consistent with the same 'disableonerr' logic of the start_start_sync.

SUGGESTION
if (MySubscription->disableonerr)
{
DisableSubscriptionOnError();
proc_exit(0);
}

Fixed.

~~~

5. src/bin/pg_dump/pg_dump.c

@@ -4463,6 +4473,9 @@ dumpSubscription(Archive *fout, const
SubscriptionInfo *subinfo)
if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
appendPQExpBufferStr(query, ", two_phase = on");

+ if (strcmp(subinfo->subdisableonerr, "f") != 0)
+ appendPQExpBufferStr(query, ", disable_on_error = true");
+

Although the code is correct, I think it would be more natural to set this option
as true when the user wants it true. e.g. check for "t"
same as 'subbinary' is doing. This way, even if there was some
unknown/corrupted value the code would do nothing, which is the behaviour
you want...

SUGGESTION
if (strcmp(subinfo->subdisableonerr, "t") == 0)

Fixed.

~~~

6. src/include/catalog/pg_subscription.h

@@ -67,6 +67,9 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId)
BKI_SHARED_RELATION BKI_ROW

char subtwophasestate; /* Stream two-phase transactions */

+ bool subdisableonerr; /* True if occurrence of apply errors
+ * should disable the subscription */

The comment seems not quite right because it's not just about apply errors. E.g.
I think any error in tablesync will cause disablement too.

BEFORE
True if occurrence of apply errors should disable the subscription SUGGESTED
True if a worker error should cause the subscription to be disabled

Fixed.

~~~

7. src/test/regress/sql/subscription.sql - whitespace

+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION
'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false,
disable_on_error = false);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
+
+\dRs+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE); DROP
+SUBSCRIPTION regress_testsub;
+

I think should be a blank line after that last \dRs+ just like the other one,
because it belongs logically with the code above it, not with the ALTER
slot_name.

Fixed.

~~~

8. src/test/subscription/t/028_disable_on_error.pl - filename

The 028 number needs to be bumped because there is already a TAP test
called 028 now

This is already done in v22, so I've skipped this.

~~~

9. src/test/subscription/t/028_disable_on_error.pl - missing test

There was no test case for the last combination where the user correct the
apply worker problem: E.g. After a previous error/disable of the subscriber,
remove the index, publish the inserts again, and check they get applied
properly.

Fixed.

Attached the updated version v24.

Best Regards,
Takamichi Osumi

Attachments:

v24-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v24-0001-Optionally-disable-subscriptions-on-error.patch
#101Peter Smith
Peter Smith
smithpb2250@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#100)
Re: Optionally automatically disable logical replication subscriptions on error

Please see below my review comments for v24.

======

1. src/backend/replication/logical/worker.c - start_table_sync

+ /* Report the worker failed during table synchronization */
+ pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

(This review comment is just FYI in case you did not do this deliberately)

FYI, you didn't really need to call am_tablesync_worker() here because
it is already asserted for the sync phase that it MUST be a tablesync
worker.

OTOH, IMO it documents the purpose of the parm so if it was deliberate
then that is OK too.

~~~

2. src/backend/replication/logical/worker.c - start_table_sync

+ PG_CATCH();
+ {
+ /*
+ * Abort the current transaction so that we send the stats message in
+ * an idle state.
+ */
+ AbortOutOfAnyTransaction();
+
+ /* Report the worker failed during table synchronization */
+ pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());
+

[Maybe you will say that this review comment is unrelated to
disable_on_err, but since this is a totally new/refactored function
then I think maybe there is no problem to make this change at the same
time. Anyway there is no function change, it is just rearranging some
comments.]

I felt the separation of those 2 statements and comments makes that
code less clean than it could/should be. IMO they should be grouped
together.

SUGGESTED
/*
* Report the worker failed during table synchronization. Abort the
* current transaction so that the stats message is sent in an idle
* state.
*/
AbortOutOfAnyTransaction();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

~~~

3. src/backend/replication/logical/worker.c - start_apply

+ /*
+ * Abort the current transaction so that we send the stats message in
+ * an idle state.
+ */
+ AbortOutOfAnyTransaction();
+
+ /* Report the worker failed during the application of the change */
+ pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

Same comment as #2 above, but this code fragment is in start_apply function.

~~~

4. src/test/subscription/t/029_disable_on_error.pl - comment

+# Drop the unique index on the sub and re-enabled the subscription.
+# Then, confirm that we have finished the apply.

SUGGESTED (tweak the comment wording)
# Drop the unique index on the sub and re-enable the subscription.
# Then, confirm that the previously failing insert was applied OK.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#102osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Peter Smith (#101)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Wednesday, March 2, 2022 9:34 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please see below my review comments for v24.

Thank you for checking my patch !

======

1. src/backend/replication/logical/worker.c - start_table_sync

+ /* Report the worker failed during table synchronization */
+ pgstat_report_subscription_error(MySubscription->oid,
+ !am_tablesync_worker());

(This review comment is just FYI in case you did not do this deliberately)

FYI, you didn't really need to call am_tablesync_worker() here because it is
already asserted for the sync phase that it MUST be a tablesync worker.

OTOH, IMO it documents the purpose of the parm so if it was deliberate then
that is OK too.

Fixed.

~~~

2. src/backend/replication/logical/worker.c - start_table_sync

+ PG_CATCH();
+ {
+ /*
+ * Abort the current transaction so that we send the stats message in
+ * an idle state.
+ */
+ AbortOutOfAnyTransaction();
+
+ /* Report the worker failed during table synchronization */
+ pgstat_report_subscription_error(MySubscription->oid,
+ !am_tablesync_worker());
+

[Maybe you will say that this review comment is unrelated to disable_on_err,
but since this is a totally new/refactored function then I think maybe there is no
problem to make this change at the same time. Anyway there is no function
change, it is just rearranging some comments.]

I felt the separation of those 2 statements and comments makes that code less
clean than it could/should be. IMO they should be grouped together.

SUGGESTED
/*
* Report the worker failed during table synchronization. Abort the
* current transaction so that the stats message is sent in an idle
* state.
*/
AbortOutOfAnyTransaction();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
er());

I think this is OK. Thank you for suggestion. Fixed.

~~~

3. src/backend/replication/logical/worker.c - start_apply

+ /*
+ * Abort the current transaction so that we send the stats message in
+ * an idle state.
+ */
+ AbortOutOfAnyTransaction();
+
+ /* Report the worker failed during the application of the change */
+ pgstat_report_subscription_error(MySubscription->oid,
+ !am_tablesync_worker());

Same comment as #2 above, but this code fragment is in start_apply function.

Fixed.

~~~

4. src/test/subscription/t/029_disable_on_error.pl - comment

+# Drop the unique index on the sub and re-enabled the subscription.
+# Then, confirm that we have finished the apply.

SUGGESTED (tweak the comment wording)
# Drop the unique index on the sub and re-enable the subscription.
# Then, confirm that the previously failing insert was applied OK.

Fixed.

Best Regards,
Takamichi Osumi

Attachments:

v25-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v25-0001-Optionally-disable-subscriptions-on-error.patch
#103Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#101)
Re: Optionally automatically disable logical replication subscriptions on error

On Wed, Mar 2, 2022 at 9:34 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please see below my review comments for v24.

======

1. src/backend/replication/logical/worker.c - start_table_sync

+ /* Report the worker failed during table synchronization */
+ pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

(This review comment is just FYI in case you did not do this deliberately)

FYI, you didn't really need to call am_tablesync_worker() here because
it is already asserted for the sync phase that it MUST be a tablesync
worker.

OTOH, IMO it documents the purpose of the parm so if it was deliberate
then that is OK too.

~~~

2. src/backend/replication/logical/worker.c - start_table_sync

+ PG_CATCH();
+ {
+ /*
+ * Abort the current transaction so that we send the stats message in
+ * an idle state.
+ */
+ AbortOutOfAnyTransaction();
+
+ /* Report the worker failed during table synchronization */
+ pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());
+

[Maybe you will say that this review comment is unrelated to
disable_on_err, but since this is a totally new/refactored function
then I think maybe there is no problem to make this change at the same
time. Anyway there is no function change, it is just rearranging some
comments.]

I felt the separation of those 2 statements and comments makes that
code less clean than it could/should be. IMO they should be grouped
together.

SUGGESTED
/*
* Report the worker failed during table synchronization. Abort the
* current transaction so that the stats message is sent in an idle
* state.
*/
AbortOutOfAnyTransaction();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

After more thoughts, should we do both AbortOutOfAnyTransaction() and
error message handling while holding interrupts? That is,

HOLD_INTERRUPTS();
EmitErrorReport();
FlushErrorState();
AbortOutOfAny Transaction();
RESUME_INTERRUPTS();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

I think it's better that we do clean up first and then do other works
such as sending the message to the stats collector and updating the
catalog.

Here are some comments on v24 patch:

+        /* Look up our subscription in the catalogs */
+        tup = SearchSysCacheCopy2(SUBSCRIPTIONNAME, MyDatabaseId,
+
CStringGetDatum(MySubscription->name));

s/catalogs/catalog/

Why don't we use SUBSCRIPTIONOID with MySubscription->oid?

---
+        if (!HeapTupleIsValid(tup))
+                ereport(ERROR,
+                                errcode(ERRCODE_UNDEFINED_OBJECT),
+                                errmsg("subscription \"%s\" does not exist",
+                                           MySubscription->name));

I think we should use elog() here rather than ereport() since it's a
should-not-happen error.

---
+        /* Notify the subscription will be no longer valid */

I'd suggest rephrasing it to like "Notify the subscription will be
disabled". (the subscription is still valid actually, but just
disabled).

---
+        /* Notify the subscription will be no longer valid */
+        ereport(LOG,
+                        errmsg("logical replication subscription
\"%s\" will be disabled due to an error",
+                                   MySubscription->name));
+

I think we can report the log at the end of this function rather than
during the transaction.

---
+my $cmd = qq(
+CREATE TABLE tbl (i INT);
+ALTER TABLE tbl REPLICA IDENTITY FULL;
+CREATE INDEX tbl_idx ON tbl(i));

I think we don't need to set REPLICA IDENTITY FULL to this table since
there is notupdate/delete.

Do we need tbl_idx?

---
+$cmd = qq(
+SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr
+WHERE sr.srsubstate IN ('s', 'r'));
+$node_subscriber->poll_query_until('postgres', $cmd);

It seems better to add a condition of srrelid just in case.

---
+$cmd = qq(
+SELECT count(1) = 1 FROM pg_catalog.pg_subscription s
+WHERE s.subname = 'sub' AND s.subenabled IS FALSE);
+$node_subscriber->poll_query_until('postgres', $cmd)
+  or die "Timed out while waiting for subscriber to be disabled";

I think that it's more natural to directly check the subscription's
subenabled. For example:

SELECT subenabled = false FROM pg_subscription WHERE subname = 'sub';

---
+$cmd = q(ALTER SUBSCRIPTION sub ENABLE);
+$node_subscriber->safe_psql('postgres', $cmd);
+$cmd = q(SELECT COUNT(1) = 3 FROM tbl WHERE i = 3);
+$node_subscriber->poll_query_until('postgres', $cmd)
+  or die "Timed out while waiting for applying";

I think it's better to wait for the subscriber to catch up and check
the query result instead of using poll_query_until() so that we can
check the query result in case where the test fails.

---
+$cmd = qq(DROP INDEX tbl_unique);
+$node_subscriber->safe_psql('postgres', $cmd);

In the newly added tap tests, all queries are consistently assigned to
$cmd and executed even when the query is used only once. It seems a
different style from the one in other tap tests. Is there any reason
why we do this style for all queries in the tap tests?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#104osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Masahiko Sawada (#103)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Wednesday, March 2, 2022 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

After more thoughts, should we do both AbortOutOfAnyTransaction() and error
message handling while holding interrupts? That is,

HOLD_INTERRUPTS();
EmitErrorReport();
FlushErrorState();
AbortOutOfAny Transaction();
RESUME_INTERRUPTS();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
er());

I think it's better that we do clean up first and then do other works such as
sending the message to the stats collector and updating the catalog.

I agree. Fixed. Along with this change, I corrected the header comment of
DisableSubscriptionOnError, too.

Here are some comments on v24 patch:

+        /* Look up our subscription in the catalogs */
+        tup = SearchSysCacheCopy2(SUBSCRIPTIONNAME, MyDatabaseId,
+
CStringGetDatum(MySubscription->name));

s/catalogs/catalog/

Why don't we use SUBSCRIPTIONOID with MySubscription->oid?

Changed.

---
+        if (!HeapTupleIsValid(tup))
+                ereport(ERROR,
+                                errcode(ERRCODE_UNDEFINED_OBJECT),
+                                errmsg("subscription \"%s\" does not
exist",
+                                           MySubscription->name));

I think we should use elog() here rather than ereport() since it's a
should-not-happen error.

Fixed

---
+        /* Notify the subscription will be no longer valid */

I'd suggest rephrasing it to like "Notify the subscription will be disabled". (the
subscription is still valid actually, but just disabled).

Fixed. Also, I've made this sentence past one, because of the code place
change below.

---
+        /* Notify the subscription will be no longer valid */
+        ereport(LOG,
+                        errmsg("logical replication subscription
\"%s\" will be disabled due to an error",
+                                   MySubscription->name));
+

I think we can report the log at the end of this function rather than during the
transaction.

Fixed. In this case, I needed to adjust the comment to indicate the processing
to disable the sub has *completed* as well.

---
+my $cmd = qq(
+CREATE TABLE tbl (i INT);
+ALTER TABLE tbl REPLICA IDENTITY FULL;
+CREATE INDEX tbl_idx ON tbl(i));

I think we don't need to set REPLICA IDENTITY FULL to this table since there is
notupdate/delete.

Do we need tbl_idx?

Removed both the replica identity and tbl_idx;

---
+$cmd = qq(
+SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr WHERE
+sr.srsubstate IN ('s', 'r'));
+$node_subscriber->poll_query_until('postgres', $cmd);

It seems better to add a condition of srrelid just in case.

Makes sense. Fixed.

---
+$cmd = qq(
+SELECT count(1) = 1 FROM pg_catalog.pg_subscription s WHERE
s.subname =
+'sub' AND s.subenabled IS FALSE);
+$node_subscriber->poll_query_until('postgres', $cmd)
+  or die "Timed out while waiting for subscriber to be disabled";

I think that it's more natural to directly check the subscription's subenabled.
For example:

SELECT subenabled = false FROM pg_subscription WHERE subname = 'sub';

Fixed. I modified another code similar to this for tablesync error as well.

---
+$cmd = q(ALTER SUBSCRIPTION sub ENABLE);
+$node_subscriber->safe_psql('postgres', $cmd); $cmd = q(SELECT
COUNT(1)
+= 3 FROM tbl WHERE i = 3);
+$node_subscriber->poll_query_until('postgres', $cmd)
+  or die "Timed out while waiting for applying";

I think it's better to wait for the subscriber to catch up and check the query
result instead of using poll_query_until() so that we can check the query result
in case where the test fails.

I modified the code to wait for the subscriber and deleted poll_query_until.
Also, when I consider the test failure for this test as you mentioned,
expecting and checking the number of return value that equals 3
would be better. So, I expressed this point in my test as well,
according to your advice.

---
+$cmd = qq(DROP INDEX tbl_unique);
+$node_subscriber->safe_psql('postgres', $cmd);

In the newly added tap tests, all queries are consistently assigned to $cmd and
executed even when the query is used only once. It seems a different style from
the one in other tap tests. Is there any reason why we do this style for all queries
in the tap tests?

Fixed. I removed the 'cmd' variable itself.

Attached an updated patch v26.

Best Regards,
Takamichi Osumi

Attachments:

v26-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v26-0001-Optionally-disable-subscriptions-on-error.patch
#105Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#104)
1 attachment(s)
Re: Optionally automatically disable logical replication subscriptions on error

On Wed, Mar 2, 2022 at 6:38 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Wednesday, March 2, 2022 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

After more thoughts, should we do both AbortOutOfAnyTransaction() and error
message handling while holding interrupts? That is,

HOLD_INTERRUPTS();
EmitErrorReport();
FlushErrorState();
AbortOutOfAny Transaction();
RESUME_INTERRUPTS();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
er());

I think it's better that we do clean up first and then do other works such as
sending the message to the stats collector and updating the catalog.

I agree. Fixed. Along with this change, I corrected the header comment of
DisableSubscriptionOnError, too.

Here are some comments on v24 patch:

+        /* Look up our subscription in the catalogs */
+        tup = SearchSysCacheCopy2(SUBSCRIPTIONNAME, MyDatabaseId,
+
CStringGetDatum(MySubscription->name));

s/catalogs/catalog/

Why don't we use SUBSCRIPTIONOID with MySubscription->oid?

Changed.

---
+        if (!HeapTupleIsValid(tup))
+                ereport(ERROR,
+                                errcode(ERRCODE_UNDEFINED_OBJECT),
+                                errmsg("subscription \"%s\" does not
exist",
+                                           MySubscription->name));

I think we should use elog() here rather than ereport() since it's a
should-not-happen error.

Fixed

---
+        /* Notify the subscription will be no longer valid */

I'd suggest rephrasing it to like "Notify the subscription will be disabled". (the
subscription is still valid actually, but just disabled).

Fixed. Also, I've made this sentence past one, because of the code place
change below.

---
+        /* Notify the subscription will be no longer valid */
+        ereport(LOG,
+                        errmsg("logical replication subscription
\"%s\" will be disabled due to an error",
+                                   MySubscription->name));
+

I think we can report the log at the end of this function rather than during the
transaction.

Fixed. In this case, I needed to adjust the comment to indicate the processing
to disable the sub has *completed* as well.

---
+my $cmd = qq(
+CREATE TABLE tbl (i INT);
+ALTER TABLE tbl REPLICA IDENTITY FULL;
+CREATE INDEX tbl_idx ON tbl(i));

I think we don't need to set REPLICA IDENTITY FULL to this table since there is
notupdate/delete.

Do we need tbl_idx?

Removed both the replica identity and tbl_idx;

---
+$cmd = qq(
+SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr WHERE
+sr.srsubstate IN ('s', 'r'));
+$node_subscriber->poll_query_until('postgres', $cmd);

It seems better to add a condition of srrelid just in case.

Makes sense. Fixed.

---
+$cmd = qq(
+SELECT count(1) = 1 FROM pg_catalog.pg_subscription s WHERE
s.subname =
+'sub' AND s.subenabled IS FALSE);
+$node_subscriber->poll_query_until('postgres', $cmd)
+  or die "Timed out while waiting for subscriber to be disabled";

I think that it's more natural to directly check the subscription's subenabled.
For example:

SELECT subenabled = false FROM pg_subscription WHERE subname = 'sub';

Fixed. I modified another code similar to this for tablesync error as well.

---
+$cmd = q(ALTER SUBSCRIPTION sub ENABLE);
+$node_subscriber->safe_psql('postgres', $cmd); $cmd = q(SELECT
COUNT(1)
+= 3 FROM tbl WHERE i = 3);
+$node_subscriber->poll_query_until('postgres', $cmd)
+  or die "Timed out while waiting for applying";

I think it's better to wait for the subscriber to catch up and check the query
result instead of using poll_query_until() so that we can check the query result
in case where the test fails.

I modified the code to wait for the subscriber and deleted poll_query_until.
Also, when I consider the test failure for this test as you mentioned,
expecting and checking the number of return value that equals 3
would be better. So, I expressed this point in my test as well,
according to your advice.

---
+$cmd = qq(DROP INDEX tbl_unique);
+$node_subscriber->safe_psql('postgres', $cmd);

In the newly added tap tests, all queries are consistently assigned to $cmd and
executed even when the query is used only once. It seems a different style from
the one in other tap tests. Is there any reason why we do this style for all queries
in the tap tests?

Fixed. I removed the 'cmd' variable itself.

Attached an updated patch v26.

Thank you for updating the patch.

Here are some comments on v26 patch:

+/*
+ * Disable the current subscription.
+ */
+static void
+DisableSubscriptionOnError(void)

This function now just updates the pg_subscription catalog so can we
move it to pg_subscritpion.c while having this function accept the
subscription OID to disable? If you agree, the function comment will
also need to be updated.

---
+                /*
+                 * First, ensure that we log the error message so
that it won't be
+                 * lost if some other internal error occurs in the
following code.
+                 * Then, abort the current transaction and send the
stats message of
+                 * the table synchronization failure in an idle state.
+                 */
+                HOLD_INTERRUPTS();
+                EmitErrorReport();
+                FlushErrorState();
+                AbortOutOfAnyTransaction();
+                RESUME_INTERRUPTS();
+                pgstat_report_subscription_error(MySubscription->oid, false);
+
+                if (MySubscription->disableonerr)
+                {
+                        DisableSubscriptionOnError();
+                        proc_exit(0);
+                }
+
+                PG_RE_THROW();

If the disableonerr is false, the same error is reported twice. Also,
the code in PG_CATCH() in both start_apply() and start_table_sync()
are almost the same. Can we create a common function to do post-error
processing?

The worker should exit with return code 1.

I've attached a fixup patch for changes to worker.c for your
reference. Feel free to adopt the changes.

---
+
+# Confirm that we have finished the table sync.
+is( $node_subscriber->safe_psql(
+                'postgres', qq(SELECT MAX(i), COUNT(*) FROM tbl)),
+        "1|3",
+        "subscription sub replicated data");
+

Can we store the result to a local variable to check? I think it's
more consistent with other tap tests.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

0001-fixup-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=0001-fixup-Optionally-disable-subscriptions-on-error.patch
#106Peter Smith
Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#105)
Re: Optionally automatically disable logical replication subscriptions on error

On Fri, Mar 4, 2022 at 5:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 2, 2022 at 6:38 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Wednesday, March 2, 2022 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

After more thoughts, should we do both AbortOutOfAnyTransaction() and error
message handling while holding interrupts? That is,

HOLD_INTERRUPTS();
EmitErrorReport();
FlushErrorState();
AbortOutOfAny Transaction();
RESUME_INTERRUPTS();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
er());

I think it's better that we do clean up first and then do other works such as
sending the message to the stats collector and updating the catalog.

I agree. Fixed. Along with this change, I corrected the header comment of
DisableSubscriptionOnError, too.

Here are some comments on v24 patch:

+        /* Look up our subscription in the catalogs */
+        tup = SearchSysCacheCopy2(SUBSCRIPTIONNAME, MyDatabaseId,
+
CStringGetDatum(MySubscription->name));

s/catalogs/catalog/

Why don't we use SUBSCRIPTIONOID with MySubscription->oid?

Changed.

---
+        if (!HeapTupleIsValid(tup))
+                ereport(ERROR,
+                                errcode(ERRCODE_UNDEFINED_OBJECT),
+                                errmsg("subscription \"%s\" does not
exist",
+                                           MySubscription->name));

I think we should use elog() here rather than ereport() since it's a
should-not-happen error.

Fixed

---
+        /* Notify the subscription will be no longer valid */

I'd suggest rephrasing it to like "Notify the subscription will be disabled". (the
subscription is still valid actually, but just disabled).

Fixed. Also, I've made this sentence past one, because of the code place
change below.

---
+        /* Notify the subscription will be no longer valid */
+        ereport(LOG,
+                        errmsg("logical replication subscription
\"%s\" will be disabled due to an error",
+                                   MySubscription->name));
+

I think we can report the log at the end of this function rather than during the
transaction.

Fixed. In this case, I needed to adjust the comment to indicate the processing
to disable the sub has *completed* as well.

---
+my $cmd = qq(
+CREATE TABLE tbl (i INT);
+ALTER TABLE tbl REPLICA IDENTITY FULL;
+CREATE INDEX tbl_idx ON tbl(i));

I think we don't need to set REPLICA IDENTITY FULL to this table since there is
notupdate/delete.

Do we need tbl_idx?

Removed both the replica identity and tbl_idx;

---
+$cmd = qq(
+SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr WHERE
+sr.srsubstate IN ('s', 'r'));
+$node_subscriber->poll_query_until('postgres', $cmd);

It seems better to add a condition of srrelid just in case.

Makes sense. Fixed.

---
+$cmd = qq(
+SELECT count(1) = 1 FROM pg_catalog.pg_subscription s WHERE
s.subname =
+'sub' AND s.subenabled IS FALSE);
+$node_subscriber->poll_query_until('postgres', $cmd)
+  or die "Timed out while waiting for subscriber to be disabled";

I think that it's more natural to directly check the subscription's subenabled.
For example:

SELECT subenabled = false FROM pg_subscription WHERE subname = 'sub';

Fixed. I modified another code similar to this for tablesync error as well.

---
+$cmd = q(ALTER SUBSCRIPTION sub ENABLE);
+$node_subscriber->safe_psql('postgres', $cmd); $cmd = q(SELECT
COUNT(1)
+= 3 FROM tbl WHERE i = 3);
+$node_subscriber->poll_query_until('postgres', $cmd)
+  or die "Timed out while waiting for applying";

I think it's better to wait for the subscriber to catch up and check the query
result instead of using poll_query_until() so that we can check the query result
in case where the test fails.

I modified the code to wait for the subscriber and deleted poll_query_until.
Also, when I consider the test failure for this test as you mentioned,
expecting and checking the number of return value that equals 3
would be better. So, I expressed this point in my test as well,
according to your advice.

---
+$cmd = qq(DROP INDEX tbl_unique);
+$node_subscriber->safe_psql('postgres', $cmd);

In the newly added tap tests, all queries are consistently assigned to $cmd and
executed even when the query is used only once. It seems a different style from
the one in other tap tests. Is there any reason why we do this style for all queries
in the tap tests?

Fixed. I removed the 'cmd' variable itself.

Attached an updated patch v26.

Thank you for updating the patch.

Here are some comments on v26 patch:

+/*
+ * Disable the current subscription.
+ */
+static void
+DisableSubscriptionOnError(void)

This function now just updates the pg_subscription catalog so can we
move it to pg_subscritpion.c while having this function accept the
subscription OID to disable? If you agree, the function comment will
also need to be updated.

---
+                /*
+                 * First, ensure that we log the error message so
that it won't be
+                 * lost if some other internal error occurs in the
following code.
+                 * Then, abort the current transaction and send the
stats message of
+                 * the table synchronization failure in an idle state.
+                 */
+                HOLD_INTERRUPTS();
+                EmitErrorReport();
+                FlushErrorState();
+                AbortOutOfAnyTransaction();
+                RESUME_INTERRUPTS();
+                pgstat_report_subscription_error(MySubscription->oid, false);
+
+                if (MySubscription->disableonerr)
+                {
+                        DisableSubscriptionOnError();
+                        proc_exit(0);
+                }
+
+                PG_RE_THROW();

If the disableonerr is false, the same error is reported twice. Also,
the code in PG_CATCH() in both start_apply() and start_table_sync()
are almost the same. Can we create a common function to do post-error
processing?

The worker should exit with return code 1.

I've attached a fixup patch for changes to worker.c for your
reference. Feel free to adopt the changes.

The way that common function is implemented required removal of the
existing PG_RE_THROW logic, which in turn was only possible using
special knowledge that this just happens to be the last try/catch
block for the apply worker. Yes, I believe everything will work ok,
but it just seemed like a step too far for me to change the throw
logic. I feel that once you get to the point of having to write
special comments in the code to explain "why we can get away with
doing this..." then that is an indication that perhaps it's not really
the best way...

Is there some alternative way to share common code, but without having
to change the existing throw error logic to do so?

OTOH, maybe others think it is ok?

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#107shiy.fnst@fujitsu.com
shiy.fnst@fujitsu.com
shiy.fnst@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#104)
RE: Optionally automatically disable logical replication subscriptions on error

On Wed, Mar 2, 2022 5:39 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote:

Attached an updated patch v26.

Thanks for your patch. A comment on the document.

@@ -7771,6 +7771,16 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l

      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subdisableonerr</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the subscription will be disabled if one of its workers
+       detects an error
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>

The document for "subdisableonerr" option is placed after "The following
parameters control what happens during subscription creation: ". I think it
should be placed after "The following parameters control the subscription's
replication behavior after it has been created: ", right?

Regards,
Shi yu

#108osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Masahiko Sawada (#105)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Friday, March 4, 2022 3:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thank you for updating the patch.

Here are some comments on v26 patch:

Thank you for your review !

+/*
+ * Disable the current subscription.
+ */
+static void
+DisableSubscriptionOnError(void)

This function now just updates the pg_subscription catalog so can we move it
to pg_subscritpion.c while having this function accept the subscription OID to
disable? If you agree, the function comment will also need to be updated.

Agreed. Fixed.

---
+                /*
+                 * First, ensure that we log the error message so
that it won't be
+                 * lost if some other internal error occurs in the
following code.
+                 * Then, abort the current transaction and send the
stats message of
+                 * the table synchronization failure in an idle state.
+                 */
+                HOLD_INTERRUPTS();
+                EmitErrorReport();
+                FlushErrorState();
+                AbortOutOfAnyTransaction();
+                RESUME_INTERRUPTS();
+                pgstat_report_subscription_error(MySubscription->oid,
+ false);
+
+                if (MySubscription->disableonerr)
+                {
+                        DisableSubscriptionOnError();
+                        proc_exit(0);
+                }
+
+                PG_RE_THROW();

If the disableonerr is false, the same error is reported twice. Also, the code in
PG_CATCH() in both start_apply() and start_table_sync() are almost the same.
Can we create a common function to do post-error processing?

Yes. Also, calling PG_RE_THROW wasn't appropriate,
because in the previous v26, for the second error you mentioned,
the patch didn't call errstart when disable_on_error = false.
This was introduced by recent patch refactoring around this code and the rebase
of this patch, but has been fixed by your suggestion.

The worker should exit with return code 1.
I've attached a fixup patch for changes to worker.c for your reference. Feel free
to adopt the changes.

Yes. I adopted almost all of your suggestion.
One thing I fixed was a comment that mentioned table sync
in worker_post_error_processing(), because start_apply()
also uses the function.

---
+
+# Confirm that we have finished the table sync.
+is( $node_subscriber->safe_psql(
+                'postgres', qq(SELECT MAX(i), COUNT(*) FROM tbl)),
+        "1|3",
+        "subscription sub replicated data");
+

Can we store the result to a local variable to check? I think it's more consistent
with other tap tests.

Agreed. Fixed.

Attached the v27. Kindly review the patch.

Best Regards,
Takamichi Osumi

Attachments:

v27-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v27-0001-Optionally-disable-subscriptions-on-error.patch
#109osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: shiy.fnst@fujitsu.com (#107)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Monday, March 7, 2022 12:01 PM Shi, Yu/侍 雨 <shiy.fnst@fujitsu.com> wrote:

On Wed, Mar 2, 2022 5:39 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Attached an updated patch v26.

Thanks for your patch. A comment on the document.

Hi, thank you for checking my patch !

@@ -7771,6 +7771,16 @@ SCRAM-SHA-256$<replaceable>&lt;iteration
count&gt;</replaceable>:<replaceable>&l

<row>
<entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subdisableonerr</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the subscription will be disabled if one of its workers
+       detects an error
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
<structfield>subconninfo</structfield> <type>text</type>
</para>
<para>

The document for "subdisableonerr" option is placed after "The following
parameters control what happens during subscription creation: ". I think it
should be placed after "The following parameters control the subscription's
replication behavior after it has been created: ", right?

Addressed your comment for create_subscription.sgml
(not for catalogs.sgml).

Attached an updated patch v28.

Best Regards,
Takamichi Osumi

Attachments:

v28-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v28-0001-Optionally-disable-subscriptions-on-error.patch
#110Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#106)
Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Mar 7, 2022 at 4:55 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Fri, Mar 4, 2022 at 5:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

---
+                /*
+                 * First, ensure that we log the error message so
that it won't be
+                 * lost if some other internal error occurs in the
following code.
+                 * Then, abort the current transaction and send the
stats message of
+                 * the table synchronization failure in an idle state.
+                 */
+                HOLD_INTERRUPTS();
+                EmitErrorReport();
+                FlushErrorState();
+                AbortOutOfAnyTransaction();
+                RESUME_INTERRUPTS();
+                pgstat_report_subscription_error(MySubscription->oid, false);
+
+                if (MySubscription->disableonerr)
+                {
+                        DisableSubscriptionOnError();
+                        proc_exit(0);
+                }
+
+                PG_RE_THROW();

If the disableonerr is false, the same error is reported twice. Also,
the code in PG_CATCH() in both start_apply() and start_table_sync()
are almost the same. Can we create a common function to do post-error
processing?

The worker should exit with return code 1.

I've attached a fixup patch for changes to worker.c for your
reference. Feel free to adopt the changes.

The way that common function is implemented required removal of the
existing PG_RE_THROW logic, which in turn was only possible using
special knowledge that this just happens to be the last try/catch
block for the apply worker.

I think we should re_throw the error in case we have not handled it by
disabling the subscription (in which case we can exit with success
code (0)).

--
With Regards,
Amit Kapila.

#111osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#110)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Monday, March 7, 2022 5:45 PM Amit Kaila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 7, 2022 at 4:55 AM Peter Smith <smithpb2250@gmail.com>
wrote:

On Fri, Mar 4, 2022 at 5:55 PM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

---
+                /*
+                 * First, ensure that we log the error message so
that it won't be
+                 * lost if some other internal error occurs in the
following code.
+                 * Then, abort the current transaction and send the
stats message of
+                 * the table synchronization failure in an idle state.
+                 */
+                HOLD_INTERRUPTS();
+                EmitErrorReport();
+                FlushErrorState();
+                AbortOutOfAnyTransaction();
+                RESUME_INTERRUPTS();
+
+ pgstat_report_subscription_error(MySubscription->oid, false);
+
+                if (MySubscription->disableonerr)
+                {
+                        DisableSubscriptionOnError();
+                        proc_exit(0);
+                }
+
+                PG_RE_THROW();

If the disableonerr is false, the same error is reported twice.
Also, the code in PG_CATCH() in both start_apply() and
start_table_sync() are almost the same. Can we create a common
function to do post-error processing?

The worker should exit with return code 1.

I've attached a fixup patch for changes to worker.c for your
reference. Feel free to adopt the changes.

The way that common function is implemented required removal of the
existing PG_RE_THROW logic, which in turn was only possible using
special knowledge that this just happens to be the last try/catch
block for the apply worker.

I think we should re_throw the error in case we have not handled it by disabling
the subscription (in which case we can exit with success code (0)).

Agreed. Fixed the patch so that it use re_throw.

Another point I changed from v28 is the order
to call AbortOutOfAnyTransaction and FlushErrorState,
which now is more aligned with other places.

Kindly check the attached v29.

Best Regards,
Takamichi Osumi

Attachments:

v29-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v29-0001-Optionally-disable-subscriptions-on-error.patch
#112Peter Smith
Peter Smith
smithpb2250@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#111)
Re: Optionally automatically disable logical replication subscriptions on error

Please find below some review comments for v29.

======

1. src/backend/replication/logical/worker.c - worker_post_error_processing

+/*
+ * Abort and cleanup the current transaction, then do post-error processing.
+ * This function must be called in a PG_CATCH() block.
+ */
+static void
+worker_post_error_processing(void)

The function comment and function name are too vague/generic. I guess
this is a hang-over from Sawada-san's proposed patch, but now since
this is only called when disabling the subscription so both the
comment and the function name should say that's what it is doing...

e.g. rename to DisableSubscriptionOnError() or something similar.

~~~

2. src/backend/replication/logical/worker.c - worker_post_error_processing

+ /* Notify the subscription has been disabled */
+ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" has been be disabled
due to an error",
+    MySubscription->name));

proc_exit(0);
}

I know this is common code, but IMO it would be better to do the
proc_exit(0); from the caller in the PG_CATCH. Then I think the code
will be much easier to read the throw/exit logic, rather than now
where it is just calling some function that never returns...

Alternatively, if you want the code how it is, then the function name
should give some hint that it is never going to return - e.g.
DisableSubscriptionOnErrorAndExit)

~~~

3. src/backend/replication/logical/worker.c - start_table_sync

+ {
+ /*
+ * Abort the current transaction so that we send the stats message
+ * in an idle state.
+ */
+ AbortOutOfAnyTransaction();
+
+ /* Report the worker failed during table synchronization */
+ pgstat_report_subscription_error(MySubscription->oid, false);
+
+ PG_RE_THROW();
+ }

(This is a repeat of a previous comment from [1]/messages/by-id/CAHut+PucrizJpqhSyD7dKj1yRkNMskqmiekD_RRqYpdDdusMRQ@mail.gmail.com comment #2)

I felt the separation of those 2 statements and comments makes the
code less clean than it could/should be. IMO they should be grouped
together.

SUGGESTED

/*
* Report the worker failed during table synchronization. Abort the
* current transaction so that the stats message is sent in an idle
* state.
*/
AbortOutOfAnyTransaction();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

~~~

4. src/backend/replication/logical/worker.c - start_apply

+ {
+ /*
+ * Abort the current transaction so that we send the stats message
+ * in an idle state.
+ */
+ AbortOutOfAnyTransaction();
+
+ /* Report the worker failed while applying changes */
+ pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());
+
+ PG_RE_THROW();
+ }

(same as #3 but comment says "while applying changes")

SUGGESTED

/*
* Report the worker failed while applying changing. Abort the current
* transaction so that the stats message is sent in an idle state.
*/
AbortOutOfAnyTransaction();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

------
[1]: /messages/by-id/CAHut+PucrizJpqhSyD7dKj1yRkNMskqmiekD_RRqYpdDdusMRQ@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

#113Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#112)
Re: Optionally automatically disable logical replication subscriptions on error

On Tue, Mar 8, 2022 at 9:37 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find below some review comments for v29.

======

1. src/backend/replication/logical/worker.c - worker_post_error_processing

+/*
+ * Abort and cleanup the current transaction, then do post-error processing.
+ * This function must be called in a PG_CATCH() block.
+ */
+static void
+worker_post_error_processing(void)

The function comment and function name are too vague/generic. I guess
this is a hang-over from Sawada-san's proposed patch, but now since
this is only called when disabling the subscription so both the
comment and the function name should say that's what it is doing...

e.g. rename to DisableSubscriptionOnError() or something similar.

~~~

2. src/backend/replication/logical/worker.c - worker_post_error_processing

+ /* Notify the subscription has been disabled */
+ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" has been be disabled
due to an error",
+    MySubscription->name));

proc_exit(0);
}

I know this is common code, but IMO it would be better to do the
proc_exit(0); from the caller in the PG_CATCH. Then I think the code
will be much easier to read the throw/exit logic, rather than now
where it is just calling some function that never returns...

Alternatively, if you want the code how it is, then the function name
should give some hint that it is never going to return - e.g.
DisableSubscriptionOnErrorAndExit)

I think we are already in error so maybe it is better to name it as
DisableSubscriptionAndExit.

Few other comments:
=================
1.
DisableSubscription()
{
..
+
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessExclusiveLock);

Why do we need AccessExclusiveLock here? The Alter/Drop Subscription
takes AccessExclusiveLock, so AccessShareLock should be sufficient
unless we have a reason to use AccessExclusiveLock lock. The other
similar usages in this file (pg_subscription.c) also take
AccessShareLock.

2. Shall we mention this feature in conflict handling docs [1]:
Now:
To skip the transaction, the subscription needs to be disabled
temporarily by ALTER SUBSCRIPTION ... DISABLE first.

After:
To skip the transaction, the subscription needs to be disabled
temporarily by ALTER SUBSCRIPTION ... DISABLE first or alternatively,
the subscription can be used with the disable_on_error option.

Feel free to use something on the above lines, if you agree.

--
With Regards,
Amit Kapila.

#114osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#113)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Tuesday, March 8, 2022 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 8, 2022 at 9:37 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find below some review comments for v29.

======

1. src/backend/replication/logical/worker.c -
worker_post_error_processing

+/*
+ * Abort and cleanup the current transaction, then do post-error processing.
+ * This function must be called in a PG_CATCH() block.
+ */
+static void
+worker_post_error_processing(void)

The function comment and function name are too vague/generic. I guess
this is a hang-over from Sawada-san's proposed patch, but now since
this is only called when disabling the subscription so both the
comment and the function name should say that's what it is doing...

e.g. rename to DisableSubscriptionOnError() or something similar.

~~~

2. src/backend/replication/logical/worker.c -
worker_post_error_processing

+ /* Notify the subscription has been disabled */ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" has been be disabled
due to an error",
+    MySubscription->name));

proc_exit(0);
}

I know this is common code, but IMO it would be better to do the
proc_exit(0); from the caller in the PG_CATCH. Then I think the code
will be much easier to read the throw/exit logic, rather than now
where it is just calling some function that never returns...

Alternatively, if you want the code how it is, then the function name
should give some hint that it is never going to return - e.g.
DisableSubscriptionOnErrorAndExit)

I think we are already in error so maybe it is better to name it as
DisableSubscriptionAndExit.

OK. Renamed.

Few other comments:
=================
1.
DisableSubscription()
{
..
+
+ LockSharedObject(SubscriptionRelationId, subid, 0,
+ AccessExclusiveLock);

Why do we need AccessExclusiveLock here? The Alter/Drop Subscription
takes AccessExclusiveLock, so AccessShareLock should be sufficient unless
we have a reason to use AccessExclusiveLock lock. The other similar usages in
this file (pg_subscription.c) also take AccessShareLock.

Fixed.

2. Shall we mention this feature in conflict handling docs [1]:
Now:
To skip the transaction, the subscription needs to be disabled temporarily by
ALTER SUBSCRIPTION ... DISABLE first.

After:
To skip the transaction, the subscription needs to be disabled temporarily by
ALTER SUBSCRIPTION ... DISABLE first or alternatively, the subscription can
be used with the disable_on_error option.

Feel free to use something on the above lines, if you agree.

Agreed. Fixed.

At the same time, the attached v30 has incorporated
some rebase results of recent commit(d3e8368)
so that start_table_sync allocates the origin names
in long-lived context. Accoring to this, I modified
some comments on this function.

I made some comments for sending stats in
start_table_sync and start_apply united and concise,
which were pointed out by Peter Smith in [1]/messages/by-id/CAHut+Ps3b8HjsVyo-Aygtnxbw1PiVOC9nvrN6dKxYtS4C8s+gw@mail.gmail.com.

[1]: /messages/by-id/CAHut+Ps3b8HjsVyo-Aygtnxbw1PiVOC9nvrN6dKxYtS4C8s+gw@mail.gmail.com

Kindly have a look at v30.

Best Regards,
Takamichi Osumi

Attachments:

v30-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v30-0001-Optionally-disable-subscriptions-on-error.patch
#115osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Peter Smith (#112)
RE: Optionally automatically disable logical replication subscriptions on error

On Tuesday, March 8, 2022 1:07 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find below some review comments for v29.

Thank you for your comments !

======

1. src/backend/replication/logical/worker.c - worker_post_error_processing

+/*
+ * Abort and cleanup the current transaction, then do post-error processing.
+ * This function must be called in a PG_CATCH() block.
+ */
+static void
+worker_post_error_processing(void)

The function comment and function name are too vague/generic. I guess this is
a hang-over from Sawada-san's proposed patch, but now since this is only
called when disabling the subscription so both the comment and the function
name should say that's what it is doing...

e.g. rename to DisableSubscriptionOnError() or something similar.

Fixed the comments and the function name in v30 shared in [1]/messages/by-id/TYCPR01MB8373B74627C6BAF2F146D779ED099@TYCPR01MB8373.jpnprd01.prod.outlook.com.

~~~

2. src/backend/replication/logical/worker.c - worker_post_error_processing

+ /* Notify the subscription has been disabled */ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" has been be disabled
due to an error",
+    MySubscription->name));

proc_exit(0);
}

I know this is common code, but IMO it would be better to do the proc_exit(0);
from the caller in the PG_CATCH. Then I think the code will be much easier to
read the throw/exit logic, rather than now where it is just calling some function
that never returns...

Alternatively, if you want the code how it is, then the function name should give
some hint that it is never going to return - e.g.
DisableSubscriptionOnErrorAndExit)

I renamed it to DisableSubscriptionAndExit in the end
according to the discussion.

~~~

3. src/backend/replication/logical/worker.c - start_table_sync

+ {
+ /*
+ * Abort the current transaction so that we send the stats message
+ * in an idle state.
+ */
+ AbortOutOfAnyTransaction();
+
+ /* Report the worker failed during table synchronization */
+ pgstat_report_subscription_error(MySubscription->oid, false);
+
+ PG_RE_THROW();
+ }

(This is a repeat of a previous comment from [1] comment #2)

I felt the separation of those 2 statements and comments makes the code less
clean than it could/should be. IMO they should be grouped together.

SUGGESTED

/*
* Report the worker failed during table synchronization. Abort the
* current transaction so that the stats message is sent in an idle
* state.
*/
AbortOutOfAnyTransaction();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
er());

Fixed.

~~~

4. src/backend/replication/logical/worker.c - start_apply

+ {
+ /*
+ * Abort the current transaction so that we send the stats message
+ * in an idle state.
+ */
+ AbortOutOfAnyTransaction();
+
+ /* Report the worker failed while applying changes */
+ pgstat_report_subscription_error(MySubscription->oid,
+ !am_tablesync_worker());
+
+ PG_RE_THROW();
+ }

(same as #3 but comment says "while applying changes")

SUGGESTED

/*
* Report the worker failed while applying changing. Abort the current
* transaction so that the stats message is sent in an idle state.
*/
AbortOutOfAnyTransaction();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
er());

Fixed. I choose the woring "while applying changes" which you mentioned first
and sounds more natural.

[1]: /messages/by-id/TYCPR01MB8373B74627C6BAF2F146D779ED099@TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi

#116Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#114)
Re: Optionally automatically disable logical replication subscriptions on error

On Tue, Mar 8, 2022 at 1:37 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Kindly have a look at v30.

Review comments:
===============
1.
+ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" has been be disabled
due to an error",

Typo.
/been be/been

2. Is there a reason the patch doesn't allow workers to restart via
maybe_reread_subscription() when this new option is changed, if so,
then let's add a comment for the same? We currently seem to be
restarting the worker on any change via Alter Subscription. If we
decide to change it for this option as well then I think we need to
accordingly update the current comment: "Exit if any parameter that
affects the remote connection was changed." to something like "Exit if
any parameter that affects the remote connection or a subscription
option was changed..."

3.
  if (fout->remoteVersion >= 150000)
- appendPQExpBufferStr(query, " s.subtwophasestate\n");
+ appendPQExpBufferStr(query, " s.subtwophasestate,\n");
  else
  appendPQExpBuffer(query,
-   " '%c' AS subtwophasestate\n",
+   " '%c' AS subtwophasestate,\n",
    LOGICALREP_TWOPHASE_STATE_DISABLED);
+ if (fout->remoteVersion >= 150000)
+ appendPQExpBuffer(query, " s.subdisableonerr\n");
+ else
+ appendPQExpBuffer(query,
+   " false AS subdisableonerr\n");

It is better to combine these parameters. I see there is a similar
coding pattern for 14 but I think that is not required.

4.
+$node_subscriber->safe_psql('postgres', qq(ALTER SUBSCRIPTION sub ENABLE));
+
+# Wait for the data to replicate.
+$node_subscriber->poll_query_until(
+ 'postgres', qq(
+SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr
+WHERE sr.srsubstate IN ('s', 'r') AND sr.srrelid = 'tbl'::regclass));

See other scripts like t/015_stream.pl and wait for data replication
in the same way. I think there are two things to change: (a) In the
above query, we use NOT IN at other places (b) use
$node_publisher->wait_for_catchup before this query.

--
With Regards,
Amit Kapila.

#117Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#114)
Re: Optionally automatically disable logical replication subscriptions on error

On Tue, Mar 8, 2022 at 5:07 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Kindly have a look at v30.

Thank you for updating the patch. Here are some comments:

+   /*
+    * Allocate the origin name in long-lived context for error context
+    * message.
+    */
+   ReplicationOriginNameForTablesync(MySubscription->oid,
+                                     MyLogicalRepWorker->relid,
+                                     originname,
+                                     sizeof(originname));
+   apply_error_callback_arg.origin_name = MemoryContextStrdup(ApplyContext,
+                                                              originname);

I think it's better to set apply_error_callback_arg.origin_name in the
caller rather than in start_table_sync(). Apply workers set
apply_error_callback_arg.origin_name there and it's not necessarily
necessary to do that in this function.

Even if we want to do that, I think it's not necessary to pass
originname to start_table_sync(). It's a local variable and used only
to temporarily store the tablesync worker's origin name.

---
It might have already been discussed but the worker disables the
subscription on an error but doesn't work for a fatal. Is that
expected or should we handle that too?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#118Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#117)
Re: Optionally automatically disable logical replication subscriptions on error

On Wed, Mar 9, 2022 at 6:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

---
It might have already been discussed but the worker disables the
subscription on an error but doesn't work for a fatal. Is that
expected or should we handle that too?

I am not too sure about handling FATALs with this feature because this
is mainly to aid in resolving conflicts due to various constraints. It
might be okay to retry in case of FATAL which is possibly due to some
system resource error. OTOH, if we see that it will be good to disable
for a FATAL error as well then I think we can use
PG_ENSURE_ERROR_CLEANUP construct. What do you think?

--
With Regards,
Amit Kapila.

#119Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#116)
Re: Optionally automatically disable logical replication subscriptions on error

On Tue, Mar 8, 2022 at 6:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 8, 2022 at 1:37 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Kindly have a look at v30.

Review comments:
===============

Few comments on test script:
=======================
1.
+# This tests the uniqueness violation will cause the subscription
+# to fail during initial synchronization and make it disabled.

/This tests the/This tests that the

2.
+$node_publisher->safe_psql('postgres',
+ qq(CREATE PUBLICATION pub FOR TABLE tbl));
+$node_subscriber->safe_psql(
+ 'postgres', qq(
+CREATE SUBSCRIPTION sub
+CONNECTION '$publisher_connstr'
+PUBLICATION pub WITH (disable_on_error = true)));

Please check other test scripts like t/015_stream.pl or
t/028_row_filter.pl and keep the indentation of these commands
similar. It looks odd and inconsistent with other tests. Also, we can
use double-quotes instead of qq so as to be consistent with other
scripts. Please check other similar places and make them consistent
with other test script files.

3.
+# Initial synchronization failure causes the subscription
+# to be disabled.

Here and in other places in test scripts, the comment lines seem too
short to me. Normally, we can keep it at the 80 char limit but this
appears too short.

4.
+# Delete the data from the subscriber and recreate the unique index.
+$node_subscriber->safe_psql(
+ 'postgres', q(
+DELETE FROM tbl;
+CREATE UNIQUE INDEX tbl_unique ON tbl (i)));

In other tests, we are executing single statements via safe_psql. I
don't see a problem with this but also don't see a reason to deviate
from the normal pattern.

--
With Regards,
Amit Kapila.

#120Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#118)
Re: Optionally automatically disable logical replication subscriptions on error

On Wed, Mar 9, 2022 at 12:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 9, 2022 at 6:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

---
It might have already been discussed but the worker disables the
subscription on an error but doesn't work for a fatal. Is that
expected or should we handle that too?

I am not too sure about handling FATALs with this feature because this
is mainly to aid in resolving conflicts due to various constraints. It
might be okay to retry in case of FATAL which is possibly due to some
system resource error. OTOH, if we see that it will be good to disable
for a FATAL error as well then I think we can use
PG_ENSURE_ERROR_CLEANUP construct. What do you think?

I think that since FATAL raised by logical replication workers (e.g.,
terminated by DDL or out of memory etc?) is normally not a repeatable
error, it's reasonable to retry in this case.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#121Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#120)
Re: Optionally automatically disable logical replication subscriptions on error

On Wed, Mar 9, 2022 at 11:22 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 9, 2022 at 12:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 9, 2022 at 6:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

---
It might have already been discussed but the worker disables the
subscription on an error but doesn't work for a fatal. Is that
expected or should we handle that too?

I am not too sure about handling FATALs with this feature because this
is mainly to aid in resolving conflicts due to various constraints. It
might be okay to retry in case of FATAL which is possibly due to some
system resource error. OTOH, if we see that it will be good to disable
for a FATAL error as well then I think we can use
PG_ENSURE_ERROR_CLEANUP construct. What do you think?

I think that since FATAL raised by logical replication workers (e.g.,
terminated by DDL or out of memory etc?) is normally not a repeatable
error, it's reasonable to retry in this case.

Yeah, I think we can add a comment in the code for this so that future
readers know that this has been done deliberately.

--
With Regards,
Amit Kapila.

#122osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#121)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Wednesday, March 9, 2022 3:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 9, 2022 at 11:22 AM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

On Wed, Mar 9, 2022 at 12:37 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Wed, Mar 9, 2022 at 6:29 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

---
It might have already been discussed but the worker disables the
subscription on an error but doesn't work for a fatal. Is that
expected or should we handle that too?

I am not too sure about handling FATALs with this feature because
this is mainly to aid in resolving conflicts due to various
constraints. It might be okay to retry in case of FATAL which is
possibly due to some system resource error. OTOH, if we see that it
will be good to disable for a FATAL error as well then I think we
can use PG_ENSURE_ERROR_CLEANUP construct. What do you think?

I think that since FATAL raised by logical replication workers (e.g.,
terminated by DDL or out of memory etc?) is normally not a repeatable
error, it's reasonable to retry in this case.

Yeah, I think we can add a comment in the code for this so that future readers
know that this has been done deliberately.

OK. I've added some comments in the codes.

The v31 addressed other comments on hackers so far.
(a) brush up the TAP test alignment
(b) fix the place of apply_error_callback_arg.origin_name for table sync worker
(c) modify maybe_reread_subscription to exit, when disable_on_error changes
(d) improve getSubscriptions to combine some branches for v15

Kindly check the attached v31.

Best Regards,
Takamichi Osumi

Attachments:

v31-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v31-0001-Optionally-disable-subscriptions-on-error.patch
#123osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#119)
RE: Optionally automatically disable logical replication subscriptions on error

On Wednesday, March 9, 2022 1:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 8, 2022 at 6:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 8, 2022 at 1:37 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Kindly have a look at v30.

Review comments:
===============

Thank you for reviewing !

Few comments on test script:
=======================
1.
+# This tests the uniqueness violation will cause the subscription # to
+fail during initial synchronization and make it disabled.

/This tests the/This tests that the

Fixed.

2.
+$node_publisher->safe_psql('postgres',
+ qq(CREATE PUBLICATION pub FOR TABLE tbl));
+$node_subscriber->safe_psql(  'postgres', qq( CREATE SUBSCRIPTION
sub
+CONNECTION '$publisher_connstr'
+PUBLICATION pub WITH (disable_on_error = true)));

Please check other test scripts like t/015_stream.pl or t/028_row_filter.pl and
keep the indentation of these commands similar. It looks odd and inconsistent
with other tests. Also, we can use double-quotes instead of qq so as to be
consistent with other scripts. Please check other similar places and make
them consistent with other test script files.

Fixed the inconsistent indentations within each commands.
Also, replace the qq with double-quotes (except for the is()'s
2nd argument, which is the aligned way to write the tests).

3.
+# Initial synchronization failure causes the subscription # to be
+disabled.

Here and in other places in test scripts, the comment lines seem too short to
me. Normally, we can keep it at the 80 char limit but this appears too short.

Fixed.

4.
+# Delete the data from the subscriber and recreate the unique index.
+$node_subscriber->safe_psql(
+ 'postgres', q(
+DELETE FROM tbl;
+CREATE UNIQUE INDEX tbl_unique ON tbl (i)));

In other tests, we are executing single statements via safe_psql. I don't see a
problem with this but also don't see a reason to deviate from the normal
pattern.

Fixed.

At the same time, I fixed one comment
where I should write "subscriber", not "sub",
since in the entire test file, I express the subscriber
by using the former.

The new patch v31 is shared in [1]/messages/by-id/TYCPR01MB8373824855A6C4D2178027A0ED0A9@TYCPR01MB8373.jpnprd01.prod.outlook.com.

[1]: /messages/by-id/TYCPR01MB8373824855A6C4D2178027A0ED0A9@TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi

#124osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Masahiko Sawada (#117)
RE: Optionally automatically disable logical replication subscriptions on error

On Wednesday, March 9, 2022 9:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 8, 2022 at 5:07 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Kindly have a look at v30.

Thank you for updating the patch. Here are some comments:

Hi, thank you for your review !

+   /*
+    * Allocate the origin name in long-lived context for error context
+    * message.
+    */
+   ReplicationOriginNameForTablesync(MySubscription->oid,
+                                     MyLogicalRepWorker->relid,
+                                     originname,
+                                     sizeof(originname));
+   apply_error_callback_arg.origin_name =
MemoryContextStrdup(ApplyContext,
+
+ originname);

I think it's better to set apply_error_callback_arg.origin_name in the caller
rather than in start_table_sync(). Apply workers set
apply_error_callback_arg.origin_name there and it's not necessarily necessary
to do that in this function.

OK. I made this origin_name logic back to the level of ApplyWorkerMain.

The new patch v31 is shared in [1]/messages/by-id/TYCPR01MB8373824855A6C4D2178027A0ED0A9@TYCPR01MB8373.jpnprd01.prod.outlook.com.

[1]: /messages/by-id/TYCPR01MB8373824855A6C4D2178027A0ED0A9@TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regardfs,
Takamichi Osumi

#125osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#116)
RE: Optionally automatically disable logical replication subscriptions on error

On Tuesday, March 8, 2022 10:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 8, 2022 at 1:37 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Kindly have a look at v30.

Review comments:

Thank you for checking !

===============
1.
+ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" has been be disabled
due to an error",

Typo.
/been be/been

Fixed.

2. Is there a reason the patch doesn't allow workers to restart via
maybe_reread_subscription() when this new option is changed, if so, then let's
add a comment for the same? We currently seem to be restarting the worker on
any change via Alter Subscription. If we decide to change it for this option as
well then I think we need to accordingly update the current comment: "Exit if
any parameter that affects the remote connection was changed." to something
like "Exit if any parameter that affects the remote connection or a subscription
option was changed..."

I thought it's ok without the change at the beginning, but I was wrong.
To make this new option aligned with others, I should add one check
for this feature. Fixed.

3.
if (fout->remoteVersion >= 150000)
- appendPQExpBufferStr(query, " s.subtwophasestate\n");
+ appendPQExpBufferStr(query, " s.subtwophasestate,\n");
else
appendPQExpBuffer(query,
-   " '%c' AS subtwophasestate\n",
+   " '%c' AS subtwophasestate,\n",
LOGICALREP_TWOPHASE_STATE_DISABLED);
+ if (fout->remoteVersion >= 150000)
+ appendPQExpBuffer(query, " s.subdisableonerr\n"); else
+ appendPQExpBuffer(query,
+   " false AS subdisableonerr\n");

It is better to combine these parameters. I see there is a similar coding pattern
for 14 but I think that is not required.

Fixed and combined them together.

4.
+$node_subscriber->safe_psql('postgres', qq(ALTER SUBSCRIPTION sub
+ENABLE));
+
+# Wait for the data to replicate.
+$node_subscriber->poll_query_until(
+ 'postgres', qq(
+SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr WHERE
+sr.srsubstate IN ('s', 'r') AND sr.srrelid = 'tbl'::regclass));

See other scripts like t/015_stream.pl and wait for data replication in the same
way. I think there are two things to change: (a) In the above query, we use NOT
IN at other places (b) use $node_publisher->wait_for_catchup before this
query.

Fixed.

The new patch is shared in [1]/messages/by-id/TYCPR01MB8373824855A6C4D2178027A0ED0A9@TYCPR01MB8373.jpnprd01.prod.outlook.com.

[1]: /messages/by-id/TYCPR01MB8373824855A6C4D2178027A0ED0A9@TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi

#126Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#125)
Re: Optionally automatically disable logical replication subscriptions on error

On Wed, Mar 9, 2022 at 4:33 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Tuesday, March 8, 2022 10:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 8, 2022 at 1:37 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

2. Is there a reason the patch doesn't allow workers to restart via
maybe_reread_subscription() when this new option is changed, if so, then let's
add a comment for the same? We currently seem to be restarting the worker on
any change via Alter Subscription. If we decide to change it for this option as
well then I think we need to accordingly update the current comment: "Exit if
any parameter that affects the remote connection was changed." to something
like "Exit if any parameter that affects the remote connection or a subscription
option was changed..."

I thought it's ok without the change at the beginning, but I was wrong.
To make this new option aligned with others, I should add one check
for this feature. Fixed.

Why do we need to restart the apply worker when disable_on_error is
changed? It doesn't affect the remote connection at all. I think it
can be changed without restarting like synchronous_commit option.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#127Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#126)
Re: Optionally automatically disable logical replication subscriptions on error

On Wed, Mar 9, 2022 at 2:21 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 9, 2022 at 4:33 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Tuesday, March 8, 2022 10:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 8, 2022 at 1:37 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

2. Is there a reason the patch doesn't allow workers to restart via
maybe_reread_subscription() when this new option is changed, if so, then let's
add a comment for the same? We currently seem to be restarting the worker on
any change via Alter Subscription. If we decide to change it for this option as
well then I think we need to accordingly update the current comment: "Exit if
any parameter that affects the remote connection was changed." to something
like "Exit if any parameter that affects the remote connection or a subscription
option was changed..."

I thought it's ok without the change at the beginning, but I was wrong.
To make this new option aligned with others, I should add one check
for this feature. Fixed.

Why do we need to restart the apply worker when disable_on_error is
changed? It doesn't affect the remote connection at all. I think it
can be changed without restarting like synchronous_commit option.

oh right, I thought that how will we update its value in
MySubscription after a change but as we re-read the pg_subscription
table for the current subscription and update MySubscription, I feel
we don't need to restart it. I haven't tested it but it should work
without a restart.

--
With Regards,
Amit Kapila.

#128osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#127)
1 attachment(s)
RE: Optionally automatically disable logical replication subscriptions on error

On Wednesday, March 9, 2022 8:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 9, 2022 at 2:21 PM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

On Wed, Mar 9, 2022 at 4:33 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Tuesday, March 8, 2022 10:23 PM Amit Kapila

<amit.kapila16@gmail.com> wrote:

On Tue, Mar 8, 2022 at 1:37 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

2. Is there a reason the patch doesn't allow workers to restart
via
maybe_reread_subscription() when this new option is changed, if
so, then let's add a comment for the same? We currently seem to be
restarting the worker on any change via Alter Subscription. If we
decide to change it for this option as well then I think we need
to accordingly update the current comment: "Exit if any parameter
that affects the remote connection was changed." to something like
"Exit if any parameter that affects the remote connection or a

subscription option was changed..."

I thought it's ok without the change at the beginning, but I was wrong.
To make this new option aligned with others, I should add one check
for this feature. Fixed.

Why do we need to restart the apply worker when disable_on_error is
changed? It doesn't affect the remote connection at all. I think it
can be changed without restarting like synchronous_commit option.

oh right, I thought that how will we update its value in MySubscription after a
change but as we re-read the pg_subscription table for the current
subscription and update MySubscription, I feel we don't need to restart it. I
haven't tested it but it should work without a restart.

Hi, attached v32 removed my additional code for maybe_reread_subscription.

Also, I judged that we don't need to add a comment for this feature in this patch.
It's because we can interpret this discussion from existing comments and codes.
(1) "Reread subscription info if needed. Most changes will be exit."
There are some cases we don't exit.
(2) Like "Exit if any parameter that affects the remote connection was changed.",
readers can understand no exit case matches the disable_on_error option change.

Kindly review the v32.

Best Regards,
Takamichi Osumi

Attachments:

v32-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v32-0001-Optionally-disable-subscriptions-on-error.patch
#129Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#128)
1 attachment(s)
Re: Optionally automatically disable logical replication subscriptions on error

On Wed, Mar 9, 2022 at 7:57 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Hi, attached v32 removed my additional code for maybe_reread_subscription.

Thanks, the patch looks good to me. I have made minor edits in the
attached. I am planning to commit this early next week (Monday) unless
there are any other major comments.

--
With Regards,
Amit Kapila.

Attachments:

v33-0001-Optionally-disable-subscriptions-on-error.patchapplication/octet-stream; name=v33-0001-Optionally-disable-subscriptions-on-error.patch
#130Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#129)
Re: Optionally automatically disable logical replication subscriptions on error

On Thu, Mar 10, 2022 at 12:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 9, 2022 at 7:57 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Hi, attached v32 removed my additional code for maybe_reread_subscription.

Thanks, the patch looks good to me. I have made minor edits in the
attached. I am planning to commit this early next week (Monday) unless
there are any other major comments.

Pushed.

--
With Regards,
Amit Kapila.

#131osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#130)
RE: Optionally automatically disable logical replication subscriptions on error

On Monday, March 14, 2022 7:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 10, 2022 at 12:04 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Mar 9, 2022 at 7:57 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Hi, attached v32 removed my additional code for

maybe_reread_subscription.

Thanks, the patch looks good to me. I have made minor edits in the
attached. I am planning to commit this early next week (Monday) unless
there are any other major comments.

Pushed.

Thank you so much !

Best Regards,
Takamichi Osumi

#132Nathan Bossart
Nathan Bossart
nathandbossart@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#131)
1 attachment(s)
Re: Optionally automatically disable logical replication subscriptions on error

My compiler is worried that syncslotname may be used uninitialized in
start_table_sync(). The attached patch seems to silence this warning.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-silence-compiler-warning.patchtext/x-diff; charset=us-ascii
#133osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Nathan Bossart (#132)
RE: Optionally automatically disable logical replication subscriptions on error

On Tuesday, March 15, 2022 8:04 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

My compiler is worried that syncslotname may be used uninitialized in
start_table_sync(). The attached patch seems to silence this warning.

Thank you for your reporting !

Your fix looks good to me.

Best Regards,
Takamichi Osumi