Skipping logical replication transactions on subscriber side

Started by Masahiko Sawadaabout 5 years ago590 messageshackers

sawada.mshk@gmail.com

about 5 years ago

Hi all,

If a logical replication worker cannot apply the change on the
subscriber for some reason (e.g., missing table or violating a
constraint, etc.), logical replication stops until the problem is
resolved. Ideally, we resolve the problem on the subscriber (e.g., by
creating the missing table or removing the conflicting data, etc.) but
occasionally a problem cannot be fixed and it may be necessary to skip
the entire transaction in question. Currently, we have two ways to
skip transactions: advancing the LSN of the replication origin on the
subscriber and advancing the LSN of the replication slot on the
publisher. But both ways might not be able to skip exactly one
transaction in question and end up skipping other transactions too.

I’d like to propose a way to skip the particular transaction on the
subscriber side. As the first step, a transaction can be specified to
be skipped by specifying remote XID on the subscriber. This feature
would need two sub-features: (1) a sub-feature for users to identify
the problem subscription and the problem transaction’s XID, and (2) a
sub-feature to skip the particular transaction to apply.

For (1), I think the simplest way would be to put the details of the
change being applied in errcontext. For example, the following
errcontext shows the remote XID as well as the action name, the
relation name, and commit timestamp:

ERROR: duplicate key value violates unique constraint "test_pkey"
DETAIL: Key (c)=(1) already exists.
CONTEXT: during apply of "INSERT" for relation "public.test" in
transaction with xid 590 commit timestamp 2021-05-21
14:32:02.134273+09

The user can identify which remote XID has a problem during applying
the change (XID=590 in this case). As another idea, we can have a
statistics view for logical replication workers, showing information
of the last failure transaction.

For (2), what I'm thinking is to add a new action to ALTER
SUBSCRIPTION command like ALTER SUBSCRIPTION test_sub SET SKIP
TRANSACTION 590. Also, we can have actions to reset it; ALTER
SUBSCRIPTION test_sub RESET SKIP TRANSACTION. Those commands add the
XID to a new column of pg_subscription or a new catalog, having the
worker reread its subscription information. Once the worker skipped
the specified transaction, it resets the transaction to skip on the
catalog. The syntax allows users to specify one remote XID to skip. In
the future, it might be good if users can also specify multiple XIDs
(a range of XIDs or a list of XIDs, etc).

Feedback and comment are very welcome.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Amit Kapila

amit.kapila16@gmail.com

about 5 years ago

In reply to: Masahiko Sawada (#1)

Re: Skipping logical replication transactions on subscriber side

On Mon, May 24, 2021 at 1:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

If a logical replication worker cannot apply the change on the
subscriber for some reason (e.g., missing table or violating a
constraint, etc.), logical replication stops until the problem is
resolved. Ideally, we resolve the problem on the subscriber (e.g., by
creating the missing table or removing the conflicting data, etc.) but
occasionally a problem cannot be fixed and it may be necessary to skip
the entire transaction in question. Currently, we have two ways to
skip transactions: advancing the LSN of the replication origin on the
subscriber and advancing the LSN of the replication slot on the
publisher. But both ways might not be able to skip exactly one
transaction in question and end up skipping other transactions too.

I’d like to propose a way to skip the particular transaction on the
subscriber side. As the first step, a transaction can be specified to
be skipped by specifying remote XID on the subscriber. This feature
would need two sub-features: (1) a sub-feature for users to identify
the problem subscription and the problem transaction’s XID, and (2) a
sub-feature to skip the particular transaction to apply.

For (1), I think the simplest way would be to put the details of the
change being applied in errcontext. For example, the following
errcontext shows the remote XID as well as the action name, the
relation name, and commit timestamp:

ERROR: duplicate key value violates unique constraint "test_pkey"
DETAIL: Key (c)=(1) already exists.
CONTEXT: during apply of "INSERT" for relation "public.test" in
transaction with xid 590 commit timestamp 2021-05-21
14:32:02.134273+09

In the above, the subscription name/id is not mentioned. I think you
need it for sub-feature-2.

The user can identify which remote XID has a problem during applying
the change (XID=590 in this case). As another idea, we can have a
statistics view for logical replication workers, showing information
of the last failure transaction.

It might be good to display at both places. Having subscriber-side
information in the view might be helpful in other ways as well like we
can use it to display the number of transactions processed by a
particular subscriber.

I think you need to consider few more things here:
(a) Say the error occurs after applying some part of changes, then
just skipping the remaining part won't be sufficient, we probably need
to someway rollback the applied changes (by rolling back the
transaction or in some other way).
(b) How do you handle streamed transactions? It is possible that some
of the streams are successful and the error occurs after that, say
when writing to the stream file. Now, would you skip writing to stream
file or will you write it, and then during apply, you will skip the
entire transaction and remove the corresponding stream file.
(c) There is also a possibility that the error occurs while applying
the changes of some subtransaction (this is only possible for
streaming xacts), so, in such cases, do we allow users to rollback the
subtransaction or user has to rollback the entire transaction. I am
not sure but maybe for very large transactions users might just want
to rollback the subtransaction.
(d) How about prepared transactions? Do we need to rollback the
prepared transaction if user decides to skip such a transaction? We
already allow prepared transactions to be streamed to plugins and the
work for subscriber-side apply is in progress [1]/messages/by-id/CAHut+PsDysQA=JWXb6oGFr1npvqi1e7RzzXV-juCCxnbiwHvfA@mail.gmail.com, so I think we need
to consider this case as well.
(e) Do we want to provide such a feature via output plugins as well,
if not, why?

For (2), what I'm thinking is to add a new action to ALTER
SUBSCRIPTION command like ALTER SUBSCRIPTION test_sub SET SKIP
TRANSACTION 590. Also, we can have actions to reset it; ALTER
SUBSCRIPTION test_sub RESET SKIP TRANSACTION. Those commands add the
XID to a new column of pg_subscription or a new catalog, having the
worker reread its subscription information. Once the worker skipped
the specified transaction, it resets the transaction to skip on the
catalog.

What if we fail while updating the reset information in the catalog?
Will it be the responsibility of the user to reset such a transaction
or we will retry it after restart of worker? Now, say, we give such a
responsibility to the user and the user forgets to reset it then there
is a possibility that after wraparound we will again skip the
transaction which is not intended. And, if we want to retry it after
restart of worker, how will the worker remember the previous failure?

I think this will be a useful feature but we need to consider few more things.

[1]: /messages/by-id/CAHut+PsDysQA=JWXb6oGFr1npvqi1e7RzzXV-juCCxnbiwHvfA@mail.gmail.com

--
With Regards,
Amit Kapila.

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 5 years ago

In reply to: Masahiko Sawada (#1)

Re: Skipping logical replication transactions on subscriber side

On Mon, May 24, 2021 at 1:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi all,

If a logical replication worker cannot apply the change on the
subscriber for some reason (e.g., missing table or violating a
constraint, etc.), logical replication stops until the problem is
resolved. Ideally, we resolve the problem on the subscriber (e.g., by
creating the missing table or removing the conflicting data, etc.) but
occasionally a problem cannot be fixed and it may be necessary to skip
the entire transaction in question. Currently, we have two ways to
skip transactions: advancing the LSN of the replication origin on the
subscriber and advancing the LSN of the replication slot on the
publisher. But both ways might not be able to skip exactly one
transaction in question and end up skipping other transactions too.

Does it mean pg_replication_origin_advance() can't skip exactly one
txn? I'm not familiar with the function or never used it though, I was
just searching for "how to skip a single txn in postgres" and ended up
in [1]https://www.postgresql.org/docs/devel/logical-replication-conflicts.html. Could you please give some more details on scenarios when we
can't skip exactly one txn? Is there any other way to advance the LSN,
something like directly updating the pg_replication_slots catalog?

[1]: https://www.postgresql.org/docs/devel/logical-replication-conflicts.html

I’d like to propose a way to skip the particular transaction on the
subscriber side. As the first step, a transaction can be specified to
be skipped by specifying remote XID on the subscriber. This feature
would need two sub-features: (1) a sub-feature for users to identify
the problem subscription and the problem transaction’s XID, and (2) a
sub-feature to skip the particular transaction to apply.

For (1), I think the simplest way would be to put the details of the
change being applied in errcontext. For example, the following
errcontext shows the remote XID as well as the action name, the
relation name, and commit timestamp:

ERROR: duplicate key value violates unique constraint "test_pkey"
DETAIL: Key (c)=(1) already exists.
CONTEXT: during apply of "INSERT" for relation "public.test" in
transaction with xid 590 commit timestamp 2021-05-21
14:32:02.134273+09

The user can identify which remote XID has a problem during applying
the change (XID=590 in this case). As another idea, we can have a
statistics view for logical replication workers, showing information
of the last failure transaction.

Agree with Amit on this. At times, it is difficult to look around in
the server logs, so it will be better to have it in both places.

For (2), what I'm thinking is to add a new action to ALTER
SUBSCRIPTION command like ALTER SUBSCRIPTION test_sub SET SKIP
TRANSACTION 590. Also, we can have actions to reset it; ALTER
SUBSCRIPTION test_sub RESET SKIP TRANSACTION. Those commands add the
XID to a new column of pg_subscription or a new catalog, having the
worker reread its subscription information. Once the worker skipped
the specified transaction, it resets the transaction to skip on the
catalog. The syntax allows users to specify one remote XID to skip. In
the future, it might be good if users can also specify multiple XIDs
(a range of XIDs or a list of XIDs, etc).

What's it like skipping a txn with txn id? Is it that the particular
txn is forced to commit or abort or just skipping some of the code in
the apply worker? IIUC, the behavior of RESET SKIP TRANSACTION is just
to forget the txn id specified in SET SKIP TRANSACTION right?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com