Optionally automatically disable logical replication subscriptions on error

Started by Mark Dilgeralmost 5 years ago133 messageshackers

mark.dilger@enterprisedb.com

almost 5 years ago

Hackers,

Logical replication apply workers for a subscription can easily get stuck in an infinite loop of attempting to apply a change, triggering an error (such as a constraint violation), exiting with an error written to the subscription worker log, and restarting.

As things currently stand, only superusers can create subscriptions. Ongoing work to delegate superuser tasks to non-superusers creates the potential for even more errors to be triggered, specifically, errors where the apply worker does not have permission to make changes to the target table.

The attached patch makes it possible to create a subscription using a new subscription_parameter, "disable_on_error", such that rather than going into an infinite loop, the apply worker will catch errors and automatically disable the subscription, breaking the loop. The new parameter defaults to false. When false, the PG_TRY/PG_CATCH overhead is avoided, so for subscriptions which do not use the feature, there shouldn't be any change. Users can manually clear the error after fixing the underlying issue with an ALTER SUBSCRIPTION .. ENABLE command.

In addition to helping on production systems, this makes writing TAP tests involving error conditions simpler. I originally ran into the motivation to write this patch when frustrated that TAP tests needed to parse the apply worker log file to determine whether permission failures were occurring and what they were. It was also obnoxiously easy to have a test get stuck waiting for a permanently stuck subscription to catch up. This helps with both issues.

I don't think this is quite ready for commit, but I'd like feedback if folks like this idea or want to suggest design changes.

Amit Kapila

amit.kapila16@gmail.com

almost 5 years ago

In reply to: Mark Dilger (#1)

Re: Optionally automatically disable logical replication subscriptions on error

On Fri, Jun 18, 2021 at 1:48 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Hackers,

Logical replication apply workers for a subscription can easily get stuck in an infinite loop of attempting to apply a change, triggering an error (such as a constraint violation), exiting with an error written to the subscription worker log, and restarting.

As things currently stand, only superusers can create subscriptions. Ongoing work to delegate superuser tasks to non-superusers creates the potential for even more errors to be triggered, specifically, errors where the apply worker does not have permission to make changes to the target table.

The attached patch makes it possible to create a subscription using a new subscription_parameter, "disable_on_error", such that rather than going into an infinite loop, the apply worker will catch errors and automatically disable the subscription, breaking the loop. The new parameter defaults to false. When false, the PG_TRY/PG_CATCH overhead is avoided, so for subscriptions which do not use the feature, there shouldn't be any change. Users can manually clear the error after fixing the underlying issue with an ALTER SUBSCRIPTION .. ENABLE command.

I see this idea has merits and it will help users to repair failing
subscriptions. Few points on a quick look at the patch: (a) The patch
seem to be assuming that the error can happen only by the apply worker
but I think the constraint violation can happen via one of the table
sync workers as well, (b) What happens if the error happens when you
are updating the error information in the catalog table. I think
instead of seeing the actual apply time error, the user might see some
other for which it won't be clear what is an appropriate action.

We are also discussing another action like skipping the apply of the
transaction on an error [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com. I think it is better to evaluate both the
proposals as one seems to be an extension of another. Adding
Sawada-San, as he is working on the other proposal.

[1]: /messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com

--
With Regards,
Amit Kapila.

Peter Smith

smithpb2250@gmail.com

almost 5 years ago

In reply to: Mark Dilger (#1)

Re: Optionally automatically disable logical replication subscriptions on error

On Fri, Jun 18, 2021 at 6:18 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Hackers,

Logical replication apply workers for a subscription can easily get stuck in an infinite loop of attempting to apply a change, triggering an error (such as a constraint violation), exiting with an error written to the subscription worker log, and restarting.

As things currently stand, only superusers can create subscriptions. Ongoing work to delegate superuser tasks to non-superusers creates the potential for even more errors to be triggered, specifically, errors where the apply worker does not have permission to make changes to the target table.

The attached patch makes it possible to create a subscription using a new subscription_parameter, "disable_on_error", such that rather than going into an infinite loop, the apply worker will catch errors and automatically disable the subscription, breaking the loop. The new parameter defaults to false. When false, the PG_TRY/PG_CATCH overhead is avoided, so for subscriptions which do not use the feature, there shouldn't be any change. Users can manually clear the error after fixing the underlying issue with an ALTER SUBSCRIPTION .. ENABLE command.

In addition to helping on production systems, this makes writing TAP tests involving error conditions simpler. I originally ran into the motivation to write this patch when frustrated that TAP tests needed to parse the apply worker log file to determine whether permission failures were occurring and what they were. It was also obnoxiously easy to have a test get stuck waiting for a permanently stuck subscription to catch up. This helps with both issues.

I don't think this is quite ready for commit, but I'd like feedback if folks like this idea or want to suggest design changes.

I tried your patch.

It applied OK (albeit with whitespace warnings).

The code build and TAP tests are all OK.

Below are a few comments and observations.

COMMENTS
========

(1) PG Docs catalogs.sgml

Documented new column "suberrmsg" but did not document the other new
columns ("disable_on_error", "disabled_by_error")?

------

(2) New column "disabled_by_error".

I wondered if there was actually any need for this column. Isn't the
same information conveyed by just having "subenabled" = false, at same
time as as non-empty "suberrmsg"? This would remove any confusion for
having 2 booleans which both indicate disabled.

------

(3) New columns "disabled_by_error", "disabled_on_error".

All other columns of the pg_subscription have a "sub" prefix.

------

(4) errhint member used?

@@ -91,12 +100,16 @@ typedef struct Subscription
  char    *name; /* Name of the subscription */
  Oid owner; /* Oid of the subscription owner */
  bool enabled; /* Indicates if the subscription is enabled */
+ bool disable_on_error; /* Whether errors automatically disable */
+ bool disabled_by_error; /* Whether an error has disabled */
  bool binary; /* Indicates if the subscription wants data in
  * binary format */
  bool stream; /* Allow streaming in-progress transactions. */
  char    *conninfo; /* Connection string to the publisher */
  char    *slotname; /* Name of the replication slot */
  char    *synccommit; /* Synchronous commit setting for worker */
+ char    *errmsg; /* Message from error which disabled */
+ char    *errhint; /* Hint from error which disabled */
  List    *publications; /* List of publication names to subscribe to */
 } Subscription;

I did not find any code using that newly added member "errhint".

------

(5) dump.c

i. No mention of new columns "disabled_on_error" and
"disabled_by_error". Is that right?

ii. Shouldn't the code for the "suberrmsg" be qualified with some PG
version number checks?

------

(6) Patch only handles errors only from the Apply worker.

Tablesync can give similar errors (e.g. PK violation during DATASYNC
phase) which will trigger re-launch forever regardless of the setting
of "disabled_on_error".
(confirmed by observations below)

------

(7) TAP test code

+$node_subscriber->init(allows_streaming => 'logical');

AFAIK that "logical" configuration is not necessary for the subscriber side. So,

$node_subscriber->init();

////////////

Some Experiments/Observations
==============================

In general, I found this functionality is useful and it works as
advertised by your patch comment.

======

Test: Display pg_subscription with the new columns
Observation: As expected. But some new colnames are not prefixed like
their peers.

test_sub=# \pset x
Expanded display is on.
test_sub=# select * from pg_subscription;
-[ RECORD 1 ]-----+--------------------------------------------------------
oid | 16394
subdbid | 16384
subname | tap_sub
subowner | 10
subenabled | t
disable_on_error | t
disabled_by_error | f
subbinary | f
substream | f
subconninfo | host=localhost dbname=test_pub application_name=tap_sub
subslotname | tap_sub
subsynccommit | off
suberrmsg |
subpublications | {tap_pub}

======

Test: Cause a PK violation during normal Apply replication (when
"disabled_on_error=true")
Observation: Apply worker stops. Subscription is disabled. Error
message is in the catalog.

2021-06-18 15:12:45.905 AEST [25904] LOG: edata is true for
subscription 'tap_sub': message = "duplicate key value violates unique
constraint "test_tab_pkey"", hint = "<NONE>"
2021-06-18 15:12:45.905 AEST [25904] LOG: logical replication apply
worker for subscription "tap_sub" will stop because the subscription
was disabled due to error
2021-06-18 15:12:45.905 AEST [25904] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:12:45.905 AEST [25904] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:12:45.908 AEST [19924] LOG: background worker "logical
replication worker" (PID 25904) exited with exit code 1

test_sub=# select * from pg_subscription;
-[ RECORD 1 ]-----+---------------------------------------------------------------
oid | 16394
subdbid | 16384
subname | tap_sub
subowner | 10
subenabled | f
disable_on_error | t
disabled_by_error | t
subbinary | f
substream | f
subconninfo | host=localhost dbname=test_pub application_name=tap_sub
subslotname | tap_sub
subsynccommit | off
suberrmsg | duplicate key value violates unique constraint
"test_tab_pkey"
subpublications | {tap_pub}

======

Test: Try to enable subscription (without fixing the PK violation problem).
Observation. OK. It just stops again

test_sub=# alter subscription tap_sub enable;
ALTER SUBSCRIPTION
test_sub=# 2021-06-18 15:17:18.067 AEST [10228] LOG: logical
replication apply worker for subscription "tap_sub" has started
2021-06-18 15:17:18.078 AEST [10228] LOG: edata is true for
subscription 'tap_sub': message = "duplicate key value violates unique
constraint "test_tab_pkey"", hint = "<NONE>"
2021-06-18 15:17:18.078 AEST [10228] LOG: logical replication apply
worker for subscription "tap_sub" will stop because the subscription
was disabled due to error
2021-06-18 15:17:18.078 AEST [10228] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:17:18.078 AEST [10228] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:17:18.079 AEST [19924] LOG: background worker "logical
replication worker" (PID 10228) exited with exit code 1

======

Test: Manually disable the subscription (which had previously already
been disabled due to error)
Observation: OK. The suberrmsg gets reset to an empty string.

alter subscription tap_sub disable;

=====

Test: Turn off the disable_on_error
Observation: As expected, now the Apply worker goes into re-launch
forever loop every time it hits PK violation

test_sub=# alter subscription tap_sub set (disable_on_error=false);
ALTER SUBSCRIPTION

...

======

Test: Cause a PK violation in the Tablesync copy (DATASYNC) phase.
(when disable_on_error = true)
Observation: This patch changes nothing for this case. The Tablesyn
re-launchs in a forever loop the same as current functionality.

test_sub=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost
dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub WITH
(disable_on_error=false);
NOTICE: created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
test_sub=# 2021-06-18 15:38:19.547 AEST [18205] LOG: logical
replication apply worker for subscription "tap_sub" has started
2021-06-18 15:38:19.557 AEST [18207] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:19.610 AEST [18207] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:19.610 AEST [18207] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:38:19.610 AEST [18207] CONTEXT: COPY test_tab, line 1
2021-06-18 15:38:19.611 AEST [19924] LOG: background worker "logical
replication worker" (PID 18207) exited with exit code 1
2021-06-18 15:38:24.634 AEST [18369] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:24.689 AEST [18369] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:24.689 AEST [18369] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:38:24.689 AEST [18369] CONTEXT: COPY test_tab, line 1
2021-06-18 15:38:24.690 AEST [19924] LOG: background worker "logical
replication worker" (PID 18369) exited with exit code 1
2021-06-18 15:38:29.701 AEST [18521] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:29.765 AEST [18521] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:29.765 AEST [18521] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:38:29.765 AEST [18521] CONTEXT: COPY test_tab, line 1
2021-06-18 15:38:29.766 AEST [19924] LOG: background worker "logical
replication worker" (PID 18521) exited with exit code 1
etc...

-[ RECORD 1 ]-----+--------------------------------------------------------
oid | 16399
subdbid | 16384
subname | tap_sub
subowner | 10
subenabled | t
disable_on_error | f
disabled_by_error | f
subbinary | f
substream | f
subconninfo | host=localhost dbname=test_pub application_name=tap_sub
subslotname | tap_sub
subsynccommit | off
suberrmsg |
subpublications | {tap_pub}

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Amit Kapila (#2)

Re: Optionally automatically disable logical replication subscriptions on error

On Jun 17, 2021, at 9:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

(a) The patch
seem to be assuming that the error can happen only by the apply worker
but I think the constraint violation can happen via one of the table
sync workers as well

You are right. Peter mentioned the same thing, and it is clearly so. I am working to repair this fault in v2 of the patch.

(b) What happens if the error happens when you
are updating the error information in the catalog table.

I think that is an entirely different kind of error. The patch attempts to catch errors caused by the user, not by core functionality of the system failing. If there is a fault that prevents the catalogs from being updated, it is unclear what the patch can do about that.

I think
instead of seeing the actual apply time error, the user might see some
other for which it won't be clear what is an appropriate action.

Good point.

Before trying to do much of anything with the caught error, the v2 patch logs the error. If the subsequent efforts to disable the subscription fail, at least the logs should contain the initial failure message. The v1 patch emitted a log message much further down, and really just intended for debugging the patch itself, with many opportunities for something else to throw before the log is written.

We are also discussing another action like skipping the apply of the
transaction on an error [1]. I think it is better to evaluate both the
proposals as one seems to be an extension of another.

Thanks for the link.

I think they are two separate options. For some users and data patterns, subscriber-side skipping of specific problematic commits will be fine. For other usage patterns, skipping earlier commits will results in more and more data integrity problems (foreign key references, etc.) such that the failures will snowball with skipping becoming the norm rather than the exception. Users with those usage patterns would likely prefer the subscription to automatically be disabled until manual intervention can clean up the problem.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Peter Smith (#3)

Re: Optionally automatically disable logical replication subscriptions on error

On Jun 17, 2021, at 11:34 PM, Peter Smith <smithpb2250@gmail.com> wrote:

I tried your patch.

Thanks for the quick and thorough review!

(2) New column "disabled_by_error".

I wondered if there was actually any need for this column. Isn't the
same information conveyed by just having "subenabled" = false, at same
time as as non-empty "suberrmsg"? This would remove any confusion for
having 2 booleans which both indicate disabled.

Yeah, I wondered about that before posting v1. I removed the disabled_by_error field for v2.

(3) New columns "disabled_by_error", "disabled_on_error".

All other columns of the pg_subscription have a "sub" prefix.

I don't feel strongly about this. How about "subdisableonerr"? I used that in v2.

I did not find any code using that newly added member "errhint".

Thanks for catching that. I had tried to remove all references to "errhint" before posting v1. The original idea was that both the message and hint of the error would be kept, but in testing I found the hint field was typically empty, so I removed it. Sorry that I left one mention of it lying around.

(5) dump.c

I didn't bother getting pg_dump working before posting v1, and I still have not done so, as I mainly want to solicit feedback on whether the basic direction I am going will work for the community.

(6) Patch only handles errors only from the Apply worker.

Tablesync can give similar errors (e.g. PK violation during DATASYNC
phase) which will trigger re-launch forever regardless of the setting
of "disabled_on_error".
(confirmed by observations below)

Yes, this is a good point, and also mentioned by Amit. I have fixed it in v2 and adjusted the regression test to trigger an automatic disabling for initial table sync as well as for change replication.

2021-06-18 15:12:45.905 AEST [25904] LOG: edata is true for
subscription 'tap_sub': message = "duplicate key value violates unique
constraint "test_tab_pkey"", hint = "<NONE>"

You didn't call this out, but FYI, I don't intend to leave this particular log message in the patch. It was for development only. I have removed it for v2 and have added a different log message much sooner after catching the error, to avoid squashing the error in case some other action fails.

The regression test shows this, if you open tmp_check/log/022_disable_on_error_subscriber.log:

2021-06-18 16:25:20.138 PDT [56926] LOG: logical replication subscription "s1" will be disabled due to error: duplicate key value violates unique constraint "s1_tbl_unique"
2021-06-18 16:25:20.139 PDT [56926] ERROR: duplicate key value violates unique constraint "s1_tbl_unique"
2021-06-18 16:25:20.139 PDT [56926] DETAIL: Key (i)=(1) already exists.
2021-06-18 16:25:20.139 PDT [56926] CONTEXT: COPY tbl, line 2

The first line logs the error prior to attempting to disable the subscription, and the next three lines are due to rethrowing the error after committing the successful disabling of the subscription. If the attempt to disable the subscription itself throws, these additional three lines won't show up, but the first one should. Amit mentioned this upthread. Do you think this will be ok, or would you like to also have a suberrdetail field so that the detail doesn't get lost? I haven't added such an extra field, and am inclined to think it would be excessive, but maybe others feel differently?

======

Test: Cause a PK violation in the Tablesync copy (DATASYNC) phase.
(when disable_on_error = true)
Observation: This patch changes nothing for this case. The Tablesyn
re-launchs in a forever loop the same as current functionality.

In v2, tablesync copy errors should also be caught. The test has been extended to cover this also.

Amit Kapila

amit.kapila16@gmail.com

almost 5 years ago

In reply to: Mark Dilger (#4)

Re: Optionally automatically disable logical replication subscriptions on error

On Sat, Jun 19, 2021 at 1:06 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 17, 2021, at 9:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

We are also discussing another action like skipping the apply of the
transaction on an error [1]. I think it is better to evaluate both the
proposals as one seems to be an extension of another.

Thanks for the link.

I think they are two separate options.

Right, but there are things that could be common from the design
perspective. For example, why is it mandatory to update this conflict
( error) information in the system catalog instead of displaying it
via some stats view? Also, why not also log the xid of the failed
transaction?

--
With Regards,
Amit Kapila.

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Amit Kapila (#6)

Re: Optionally automatically disable logical replication subscriptions on error

On Jun 19, 2021, at 3:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Right, but there are things that could be common from the design
perspective.

I went to reconcile my patch with that from [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com only to discover there is no patch on that thread. Is there one in progress that I can see?

I don't mind trying to reconcile this patch with what you're discussing in [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com, but I am a bit skeptical about [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com becoming a reality and I don't want to entirely hitch this patch to that effort. This can be committed with or without any solution to the idea in [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com. The original motivation for this patch was that the TAP tests don't have a great way to deal with a subscription getting into a fail-retry infinite loop, which makes it harder for me to make progress on [2]/messages/by-id/915B995D-1D79-4E0A-BD8D-3B267925FCE9@enterprisedb.com — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company. That doesn't absolve me of the responsibility of making this patch a good one, but it does motivate me to keep it simple.

For example, why is it mandatory to update this conflict
( error) information in the system catalog instead of displaying it
via some stats view?

The catalog must be updated to disable the subscription, so placing the error information in the same row doesn't require any independent touching of the catalogs. Likewise, the catalog must be updated to re-enable the subscription, so clearing the error from that same row doesn't require any independent touching of the catalogs.

The error information does not *need* to be included in the catalog, but placing the information in any location that won't survive server restart leaves the user no information about why the subscription got disabled after a restart (or crash + restart) happens.

Furthermore, since v2 removed the "disabled_by_error" field in favor of just using subenabled + suberrmsg to determine if the subscription was automatically disabled, not having the information in the catalog would make it ambiguous whether the subscription was manually or automatically disabled.

Also, why not also log the xid of the failed
transaction?

We could also do that. Reading [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com, it seems you are overly focused on user-facing xids. The errdetail in the examples I've been using for testing, and the one mentioned in [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com, contain information about the conflicting data. I think users are more likely to understand that a particular primary key value cannot be replicated because it is not unique than to understand that a particular xid cannot be replicated. (Likewise for permissions errors.) For example:

2021-06-18 16:25:20.139 PDT [56926] ERROR: duplicate key value violates unique constraint "s1_tbl_unique"
2021-06-18 16:25:20.139 PDT [56926] DETAIL: Key (i)=(1) already exists.
2021-06-18 16:25:20.139 PDT [56926] CONTEXT: COPY tbl, line 2

This tells the user what they need to clean up before they can continue. Telling them which xid tried to apply the change, but not the change itself or the conflict itself, seems rather unhelpful. So at best, the xid is an additional piece of information. I'd rather have both the ERROR and DETAIL fields above and not the xid than have the xid and lack one of those two fields. Even so, I have not yet included the DETAIL field because I didn't want to bloat the catalog.

For the problem in [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com, having the xid is more important than it is in my patch, because the user is expected in [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com to use the xid as a handle. But that seems like an odd interface to me. Imagine that a transaction on the publisher side inserted a batch of data, and only a subset of that data conflicts on the subscriber side. What advantage is there in skipping the entire transaction? Wouldn't the user rather skip just the problematic rows? I understand that on the subscriber side it is difficult to do so, but if you are going to implement this sort of thing, it makes more sense to allow the user to filter out data that is problematic rather than filtering out xids that are problematic, and the filter shouldn't just be an in-or-out filter, but rather a mapping function that can redirect the data someplace else or rewrite it before inserting or change the pre-existing conflicting data prior to applying the problematic data or whatever. That's a huge effort, of course, but if the idea in [1]/messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com goes in that direction, I don't want my patch to have already added an xid field which ultimately nobody wants.

[1]: /messages/by-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com

[2]: /messages/by-id/915B995D-1D79-4E0A-BD8D-3B267925FCE9@enterprisedb.com — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Mark Dilger (#7)

Re: Optionally automatically disable logical replication subscriptions on error

On Jun 19, 2021, at 7:44 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

Wouldn't the user rather skip just the problematic rows? I understand that on the subscriber side it is difficult to do so, but if you are going to implement this sort of thing, it makes more sense to allow the user to filter out data that is problematic rather than filtering out xids that are problematic, and the filter shouldn't just be an in-or-out filter, but rather a mapping function that can redirect the data someplace else or rewrite it before inserting or change the pre-existing conflicting data prior to applying the problematic data or whatever.

Thinking about this some more, it seems my patch already sets the stage for this sort of thing.

We could extend the concept of triggers to something like ErrorTriggers that could be associated with subscriptions. I already have the code catching errors for subscriptions where disable_on_error is true. We could use that same code path for subscriptions that have one or more BEFORE or AFTER ErrorTriggers defined. We could pass the trigger all the error context information along with the row and subscription information, and allow the trigger to either modify the data being replicated or make modifications to the table being changed. I think having support for both BEFORE and AFTER would be important, as a common design pattern might be to move aside the conflicting rows in the BEFORE trigger, then reconcile and merge them back into the table in the AFTER trigger. If the xid still cannot be replicated after one attempt using the triggers, the second attempt to disable the subscription instead.

There are a lot of details to consider, but to my mind this idea is much more user friendly than the idea that users should muck about with xids for arbitrarily many conflicting transactions.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Mark Dilger (#7)

Re: Optionally automatically disable logical replication subscriptions on error

On Sat, Jun 19, 2021 at 11:44 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 19, 2021, at 3:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Right, but there are things that could be common from the design
perspective.

I went to reconcile my patch with that from [1] only to discover there is no patch on that thread. Is there one in progress that I can see?

I will submit the patch.

I don't mind trying to reconcile this patch with what you're discussing in [1], but I am a bit skeptical about [1] becoming a reality and I don't want to entirely hitch this patch to that effort. This can be committed with or without any solution to the idea in [1]. The original motivation for this patch was that the TAP tests don't have a great way to deal with a subscription getting into a fail-retry infinite loop, which makes it harder for me to make progress on [2]. That doesn't absolve me of the responsibility of making this patch a good one, but it does motivate me to keep it simple.

There was a discussion that the skipping transaction patch would also
need to have a feature that tells users the details of the last
failure transaction such as its XID, timestamp, action etc. In that
sense, those two patches might need the common infrastructure that the
apply workers leave the error details somewhere so that the users can
see it.

For example, why is it mandatory to update this conflict
( error) information in the system catalog instead of displaying it
via some stats view?

The catalog must be updated to disable the subscription, so placing the error information in the same row doesn't require any independent touching of the catalogs. Likewise, the catalog must be updated to re-enable the subscription, so clearing the error from that same row doesn't require any independent touching of the catalogs.

The error information does not *need* to be included in the catalog, but placing the information in any location that won't survive server restart leaves the user no information about why the subscription got disabled after a restart (or crash + restart) happens.

Furthermore, since v2 removed the "disabled_by_error" field in favor of just using subenabled + suberrmsg to determine if the subscription was automatically disabled, not having the information in the catalog would make it ambiguous whether the subscription was manually or automatically disabled.

Is it really useful to write only error message to the system catalog?
Even if we see the error message like "duplicate key value violates
unique constraint “test_tab_pkey”” on the system catalog, we will end
up needing to check the server log for details to properly resolve the
conflict. If the user wants to know whether the subscription is
disabled manually or automatically, the error message on the system
catalog might not necessarily be necessary.

For the problem in [1], having the xid is more important than it is in my patch, because the user is expected in [1] to use the xid as a handle. But that seems like an odd interface to me. Imagine that a transaction on the publisher side inserted a batch of data, and only a subset of that data conflicts on the subscriber side. What advantage is there in skipping the entire transaction? Wouldn't the user rather skip just the problematic rows? I understand that on the subscriber side it is difficult to do so, but if you are going to implement this sort of thing, it makes more sense to allow the user to filter out data that is problematic rather than filtering out xids that are problematic, and the filter shouldn't just be an in-or-out filter, but rather a mapping function that can redirect the data someplace else or rewrite it before inserting or change the pre-existing conflicting data prior to applying the problematic data or whatever. That's a huge effort, of course, but if the idea in [1] goes in that direction, I don't want my patch to have already added an xid field which ultimately nobody wants.

The feature discussed in that thread is meant to be a repair tool for
the subscription in emergency cases when something that should not
have happened happened. I guess that resolving row (or column) level
conflict should be done in another way, for example, by defining
policies for each type of conflict.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#10

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Masahiko Sawada (#9)

Re: Optionally automatically disable logical replication subscriptions on error

On Jun 20, 2021, at 7:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I will submit the patch.

Great, thanks!

There was a discussion that the skipping transaction patch would also
need to have a feature that tells users the details of the last
failure transaction such as its XID, timestamp, action etc. In that
sense, those two patches might need the common infrastructure that the
apply workers leave the error details somewhere so that the users can
see it.

Right. Subscription on error triggers would need that, too, if we wrote them.

Is it really useful to write only error message to the system catalog?
Even if we see the error message like "duplicate key value violates
unique constraint “test_tab_pkey”” on the system catalog, we will end
up needing to check the server log for details to properly resolve the
conflict. If the user wants to know whether the subscription is
disabled manually or automatically, the error message on the system
catalog might not necessarily be necessary.

We can put more information in there. I don't feel strongly about it. I'll wait for your patch to see what infrastructure you need.

The feature discussed in that thread is meant to be a repair tool for
the subscription in emergency cases when something that should not
have happened happened. I guess that resolving row (or column) level
conflict should be done in another way, for example, by defining
policies for each type of conflict.

I understand that is the idea, but I'm having trouble believing it will work that way in practice. If somebody has a subscription that has gone awry, what reason do we have to believe there will only be one transaction that will need to be manually purged? It seems just as likely that there would be a million transactions that need to be purged, and creating an interface for users to manually review them and keep or discard on a case by case basis seems unworkable. Sure, you might have specific cases where the number of transactions to purge is small, but I don't like designing the feature around that assumption.

All the same, I'm looking forward to seeing your patch!

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Amit Kapila

amit.kapila16@gmail.com

almost 5 years ago

In reply to: Mark Dilger (#7)

Re: Optionally automatically disable logical replication subscriptions on error

On Sat, Jun 19, 2021 at 8:14 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 19, 2021, at 3:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Also, why not also log the xid of the failed
transaction?

We could also do that. Reading [1], it seems you are overly focused on user-facing xids. The errdetail in the examples I've been using for testing, and the one mentioned in [1], contain information about the conflicting data. I think users are more likely to understand that a particular primary key value cannot be replicated because it is not unique than to understand that a particular xid cannot be replicated. (Likewise for permissions errors.) For example:

2021-06-18 16:25:20.139 PDT [56926] ERROR: duplicate key value violates unique constraint "s1_tbl_unique"
2021-06-18 16:25:20.139 PDT [56926] DETAIL: Key (i)=(1) already exists.
2021-06-18 16:25:20.139 PDT [56926] CONTEXT: COPY tbl, line 2

This tells the user what they need to clean up before they can continue. Telling them which xid tried to apply the change, but not the change itself or the conflict itself, seems rather unhelpful. So at best, the xid is an additional piece of information. I'd rather have both the ERROR and DETAIL fields above and not the xid than have the xid and lack one of those two fields. Even so, I have not yet included the DETAIL field because I didn't want to bloat the catalog.

I never said that we don't need the error information. I think we need
xid along with other things.

For the problem in [1], having the xid is more important than it is in my patch, because the user is expected in [1] to use the xid as a handle. But that seems like an odd interface to me. Imagine that a transaction on the publisher side inserted a batch of data, and only a subset of that data conflicts on the subscriber side. What advantage is there in skipping the entire transaction? Wouldn't the user rather skip just the problematic rows?

I think skipping some changes but not others can make the final
transaction data inconsistent. Say, we have a case where, in a
transaction after insert, there is an update or delete on the same
row, then we might silently skip such updates/deletes unless the same
row is already present in the subscriber. I think skipping the entire
transaction based on user instruction would be safer than skipping
some changes that lead to an error.

--
With Regards,
Amit Kapila.

#12

Amit Kapila

amit.kapila16@gmail.com

almost 5 years ago

In reply to: Mark Dilger (#10)

Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Jun 21, 2021 at 7:56 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 20, 2021, at 7:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I will submit the patch.

Great, thanks!

There was a discussion that the skipping transaction patch would also
need to have a feature that tells users the details of the last
failure transaction such as its XID, timestamp, action etc. In that
sense, those two patches might need the common infrastructure that the
apply workers leave the error details somewhere so that the users can
see it.

Right. Subscription on error triggers would need that, too, if we wrote them.

Is it really useful to write only error message to the system catalog?
Even if we see the error message like "duplicate key value violates
unique constraint “test_tab_pkey”” on the system catalog, we will end
up needing to check the server log for details to properly resolve the
conflict. If the user wants to know whether the subscription is
disabled manually or automatically, the error message on the system
catalog might not necessarily be necessary.

I think the two key points are (a) to define exactly what all
information is required to be logged on error, (b) where do we want to
store the information based on requirements. I see that for (b) Mark
is inclined to use the existing catalog table. I feel that is worth
considering but not sure if that is the best way to deal with it. For
example, if we store that information in the catalog, we might need to
consider storing it both in pg_subscription and pg_subscription_rel,
otherwise, we might overwrite the errors as I think what is happening
in the currently proposed patch. The other possibilities could be to
define a new catalog table to capture the error information or log the
required information via stats collector and then the user can see
that info via some stats view.

We can put more information in there. I don't feel strongly about it. I'll wait for your patch to see what infrastructure you need.

The feature discussed in that thread is meant to be a repair tool for
the subscription in emergency cases when something that should not
have happened happened. I guess that resolving row (or column) level
conflict should be done in another way, for example, by defining
policies for each type of conflict.

I understand that is the idea, but I'm having trouble believing it will work that way in practice. If somebody has a subscription that has gone awry, what reason do we have to believe there will only be one transaction that will need to be manually purged?

Because currently, we don't proceed after an error unless it is
resolved. Why do you think there could be multiple such transactions?

--
With Regards,
Amit Kapila.

#13

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Amit Kapila (#12)

Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Jun 21, 2021 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 7:56 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 20, 2021, at 7:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I will submit the patch.

Great, thanks!

There was a discussion that the skipping transaction patch would also
need to have a feature that tells users the details of the last
failure transaction such as its XID, timestamp, action etc. In that
sense, those two patches might need the common infrastructure that the
apply workers leave the error details somewhere so that the users can
see it.

Right. Subscription on error triggers would need that, too, if we wrote them.

Is it really useful to write only error message to the system catalog?
Even if we see the error message like "duplicate key value violates
unique constraint “test_tab_pkey”” on the system catalog, we will end
up needing to check the server log for details to properly resolve the
conflict. If the user wants to know whether the subscription is
disabled manually or automatically, the error message on the system
catalog might not necessarily be necessary.

I think the two key points are (a) to define exactly what all
information is required to be logged on error,

When it comes to the patch for skipping transactions, it would
somewhat depend on how users specify transactions to skip. On the
other hand, for this patch, the minimal information would be whether
the subscription is disabled automatically by the server.

(b) where do we want to
store the information based on requirements. I see that for (b) Mark
is inclined to use the existing catalog table. I feel that is worth
considering but not sure if that is the best way to deal with it. For
example, if we store that information in the catalog, we might need to
consider storing it both in pg_subscription and pg_subscription_rel,
otherwise, we might overwrite the errors as I think what is happening
in the currently proposed patch. The other possibilities could be to
define a new catalog table to capture the error information or log the
required information via stats collector and then the user can see
that info via some stats view.

This point is also related to the point whether or not that
information needs to last after the server crash (and restart). When
it comes to the patch for skipping transactions, there was a
discussion that we don’t necessarily need it since the tools will be
used in rare cases. But for this proposed patch, I guess it would be
useful if it does. It might be worth considering doing a different way
for each patch. For example, we send the details of last failure
transaction to the stats collector while updating subenabled to
something like “automatically-disabled” instead of to just “false” (or
using another column to show the subscriber is disabled automatically
by the server).

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#14

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Amit Kapila (#12)

Re: Optionally automatically disable logical replication subscriptions on error

On Jun 20, 2021, at 8:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Because currently, we don't proceed after an error unless it is
resolved. Why do you think there could be multiple such transactions?

Just as one example, if the subscriber has a unique index that the publisher lacks, any number of transactions could add non-unique data that then fails to apply on the subscriber. My patch took the view that the user should figure out how to get the subscriber side consistent with the publisher side, but if you instead take the approach that problematic commits should be skipped, it would seem that arbitrarily many such transactions could be committed on the publisher side.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15

Amit Kapila

amit.kapila16@gmail.com

almost 5 years ago

In reply to: Mark Dilger (#14)

Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Jun 21, 2021 at 10:24 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 20, 2021, at 8:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Because currently, we don't proceed after an error unless it is
resolved. Why do you think there could be multiple such transactions?

Just as one example, if the subscriber has a unique index that the publisher lacks, any number of transactions could add non-unique data that then fails to apply on the subscriber.

Then also it will fail on the first such conflict, so even without
your patch, the apply worker corresponding to the subscription won't
be able to proceed after the first error, it won't lead to multiple
failing xids. However, I see a different case where there could be
multiple failing xids and that can happen during initial table sync
where multiple workers failed due to some error. I am not sure your
patch would be able to capture all such failed transactions because
you are recording this information in pg_subscription and not in
pg_subscription_rel.

--
With Regards,
Amit Kapila.

#16

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Amit Kapila (#12)

Re: Optionally automatically disable logical replication subscriptions on error

On Jun 20, 2021, at 8:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

(a) to define exactly what all
information is required to be logged on error, (b) where do we want to
store the information based on requirements.

I'm not sure it has to be stored anywhere durable. I have a patch in the works to do something like:

create function foreign_key_insert_violation_before() returns conflict_trigger as $$
BEGIN
RAISE NOTICE 'elevel: %', TG_ELEVEL:
RAISE NOTICE 'sqlerrcode: %', TG_SQLERRCODE:
RAISE NOTICE 'message: %', TG_MESSAGE:
RAISE NOTICE 'detail: %', TG_DETAIL:
RAISE NOTICE 'detail_log: %', TG_DETAIL_LOG:
RAISE NOTICE 'hint: %', TG_HINT:
RAISE NOTICE 'schema: %', TG_SCHEMA_NAME:
RAISE NOTICE 'table: %', TG_TABLE_NAME:
RAISE NOTICE 'column: %', TG_COLUMN_NAME:
RAISE NOTICE 'datatype: %', TG_DATATYPE_NAME:
RAISE NOTICE 'constraint: %', TG_CONSTRAINT_NAME:

-- do something useful to prepare for retry of transaction
-- which raised a foreign key violation
END
$$ language plpgsql;

create function foreign_key_insert_violation_after() returns conflict_trigger as $$
BEGIN
-- do something useful to cleanup after retry of transaction
-- which raised a foreign key violation
END
$$ language plpgsql;

create conflict trigger regress_conflict_trigger_insert on regress_conflictsub
before foreign_key_violation
when tag in ('INSERT')
execute procedure foreign_key_insert_violation_before();

create conflict trigger regress_conflict_trigger_insert on regress_conflictsub
after foreign_key_violation
when tag in ('INSERT')
execute procedure foreign_key_insert_violation_after();

The idea is that, for subscriptions that have conflict triggers defined, the apply will be wrapped in a PG_TRY()/PG_CATCH() block. If it fails, the ErrorData will be copied in the ConflictTriggerContext, and then the transaction will be attempted again, but this time with any BEFORE and AFTER triggers applied. The triggers could then return a special result indicating whether the transaction should be permanently skipped, applied, or whatever. None of the data needs to be stored anywhere non-transient, as it just gets handed to the triggers.

I think the other patch is a subset of this functionality, as using this system to create triggers which query a table containing transactions to be skipped would be enough to get the functionality you've been discussing. But this system could also do other things, like modify data. Admittedly, this is akin to a statement level trigger and not a row level trigger, so a number of things you might want to do would be hard to do from this. But perhaps the equivalent of row level triggers could also be written?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Amit Kapila (#15)

Re: Optionally automatically disable logical replication subscriptions on error

On Jun 20, 2021, at 10:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Then also it will fail on the first such conflict, so even without
your patch, the apply worker corresponding to the subscription won't
be able to proceed after the first error, it won't lead to multiple
failing xids.

I'm not sure we're talking about the same thing. I'm saying that if the user is expected to clear each error manually, there could be many such errors for them to clear. It may be true that the second error doesn't occur on the subscriber side until after the first is cleared, but that still leaves the user having to clear one after the next until arbitrarily many of them coming from the publisher side are cleared.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Amit Kapila (#15)

Re: Optionally automatically disable logical replication subscriptions on error

On Jun 20, 2021, at 10:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

However, I see a different case where there could be
multiple failing xids and that can happen during initial table sync
where multiple workers failed due to some error. I am not sure your
patch would be able to capture all such failed transactions because
you are recording this information in pg_subscription and not in
pg_subscription_rel.

Right, I wasn't trying to capture everything, just enough to give the user a reasonable indication of what went wrong. My patch was designed around the idea that the user would need to figure out how to fix the subscriber side prior to re-enabling the subscription. As such, I wasn't bothered with trying to store everything, just enough to give the user a clue where to look. I don't mind if you want to store more information, and maybe that needs to be stored somewhere else. Do you believe pg_subscription_rel is a suitable location?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19

Amit Kapila

amit.kapila16@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#13)

Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Jun 21, 2021 at 9:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jun 21, 2021 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 21, 2021 at 7:56 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 20, 2021, at 7:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I will submit the patch.

Great, thanks!

There was a discussion that the skipping transaction patch would also
need to have a feature that tells users the details of the last
failure transaction such as its XID, timestamp, action etc. In that
sense, those two patches might need the common infrastructure that the
apply workers leave the error details somewhere so that the users can
see it.

Right. Subscription on error triggers would need that, too, if we wrote them.

Is it really useful to write only error message to the system catalog?
Even if we see the error message like "duplicate key value violates
unique constraint “test_tab_pkey”” on the system catalog, we will end
up needing to check the server log for details to properly resolve the
conflict. If the user wants to know whether the subscription is
disabled manually or automatically, the error message on the system
catalog might not necessarily be necessary.

I think the two key points are (a) to define exactly what all
information is required to be logged on error,

When it comes to the patch for skipping transactions, it would
somewhat depend on how users specify transactions to skip. On the
other hand, for this patch, the minimal information would be whether
the subscription is disabled automatically by the server.

True, but still there will be some information related to ERROR which
we wanted the user to see unless we ask them to refer to logs for
that.

(b) where do we want to
store the information based on requirements. I see that for (b) Mark
is inclined to use the existing catalog table. I feel that is worth
considering but not sure if that is the best way to deal with it. For
example, if we store that information in the catalog, we might need to
consider storing it both in pg_subscription and pg_subscription_rel,
otherwise, we might overwrite the errors as I think what is happening
in the currently proposed patch. The other possibilities could be to
define a new catalog table to capture the error information or log the
required information via stats collector and then the user can see
that info via some stats view.

This point is also related to the point whether or not that
information needs to last after the server crash (and restart). When
it comes to the patch for skipping transactions, there was a
discussion that we don’t necessarily need it since the tools will be
used in rare cases. But for this proposed patch, I guess it would be
useful if it does. It might be worth considering doing a different way
for each patch. For example, we send the details of last failure
transaction to the stats collector while updating subenabled to
something like “automatically-disabled” instead of to just “false” (or
using another column to show the subscriber is disabled automatically
by the server).

I agree that it is worth considering to have subenabled to have a
tri-state (enable, disabled, automatically-disabled) value instead of
just a boolean. But in this case, if the stats collector missed
updating the information, the user may have to manually update the
subscription and let the error happen again to see it.

--
With Regards,
Amit Kapila.

#20

Amit Kapila

amit.kapila16@gmail.com

almost 5 years ago

In reply to: Mark Dilger (#18)

Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Jun 21, 2021 at 10:55 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jun 20, 2021, at 10:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

However, I see a different case where there could be
multiple failing xids and that can happen during initial table sync
where multiple workers failed due to some error. I am not sure your
patch would be able to capture all such failed transactions because
you are recording this information in pg_subscription and not in
pg_subscription_rel.

Right, I wasn't trying to capture everything, just enough to give the user a reasonable indication of what went wrong. My patch was designed around the idea that the user would need to figure out how to fix the subscriber side prior to re-enabling the subscription. As such, I wasn't bothered with trying to store everything, just enough to give the user a clue where to look.

Okay, but the clue will be pretty random because you might end up just
logging one out of several errors.

I don't mind if you want to store more information, and maybe that needs to be stored somewhere else. Do you believe pg_subscription_rel is a suitable location?

It won't be sufficient to store information in either
pg_subscription_rel or pg_susbscription. I think if we want to store
the required information in a catalog then we need to define a new
catalog (pg_subscription_conflicts or something like that) with
information corresponding to each rel in subscription (srsubid oid
(Reference to subscription), srrelid oid (Reference to relation),
<columns for error_info>). OTOH, we can choose to send the error
information to stats collector which will then be available via stat
view and update system catalog to disable the subscription but there
will be a risk that we might send info of failed transaction to stats
collector but then fail to update system catalog to disable the
subscription.

--
With Regards,
Amit Kapila.

#21

Amit Kapila

amit.kapila16@gmail.com

almost 5 years ago

In reply to: Amit Kapila (#20)

#22

Peter Smith

smithpb2250@gmail.com

almost 5 years ago

In reply to: Amit Kapila (#21)

#23