Disabled logical replication origin session causes primary key errors

Started by Shawn McCoy12 months ago17 messagesbugs
Jump to latest
#1Shawn McCoy
shawn.the.mccoy@gmail.com

Hello,

We have discovered a recent regression in the Origin handling of logical
replication apply workers. We have found the cause of the issue was due to
the worker resetting its local origin session information during the
processing of an error that is
silently handled allowing the worker to continue. We suspect this is caused
by the recent change made in the following thread,
/messages/by-id/TYAPR01MB5692FAC23BE40C69DA8ED4AFF5B92@TYAPR01MB5692.jpnprd01.prod.outlook.com
.

The logical replication apply worker will originally setup the origin
correctly. However, on the first insert will call into the trigger which
will raise an exception. This exception will execute the error callback
that resets the origin session state. The exception will then be silently
handled, returning execution back to the apply worker. In the second
reproduction, a function based index is used with the same result.

At this point, the apply worker can continue to commit these changes, but
has cleared all local origin session state. As a result, we will not update
our remote to local LSN mapping of the origin. Allowing for duplicate data
to be applied.

This was tested and observed in at least these versions:

PostgreSQL 16.8
PostgreSQL 17.4

We provide a simple reproduction of the issue below in 2 separate use-cases.

Regards,
Shawn McCoy, Drew Callahan, Scott Mead

Attachments:

repro_local_origin_stuck.txttext/plain; charset=US-ASCII; name=repro_local_origin_stuck.txtDownload
repro_local_origin_stuck_causing_duplicate_key.txttext/plain; charset=US-ASCII; name=repro_local_origin_stuck_causing_duplicate_key.txtDownload
#2Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Shawn McCoy (#1)
Re: Disabled logical replication origin session causes primary key errors

Hi,

Thank you for the report and sharing the reproducers.

On Fri, Apr 18, 2025 at 2:22 PM Shawn McCoy <shawn.the.mccoy@gmail.com> wrote:

Hello,

We have discovered a recent regression in the Origin handling of logical replication apply workers. We have found the cause of the issue was due to the worker resetting its local origin session information during the processing of an error that is
silently handled allowing the worker to continue. We suspect this is caused by the recent change made in the following thread, /messages/by-id/TYAPR01MB5692FAC23BE40C69DA8ED4AFF5B92@TYAPR01MB5692.jpnprd01.prod.outlook.com.

The logical replication apply worker will originally setup the origin correctly. However, on the first insert will call into the trigger which will raise an exception. This exception will execute the error callback that resets the origin session state. The exception will then be silently handled, returning execution back to the apply worker. In the second reproduction, a function based index is used with the same result.

At this point, the apply worker can continue to commit these changes, but has cleared all local origin session state. As a result, we will not update our remote to local LSN mapping of the origin. Allowing for duplicate data to be applied.

I agree with your analysis. When the subscriber restarts the logical
replication, changes that happened since the last acknowledged LSN
would be replicated again.

With commit 3f28b2fcac33f, we reset the replication origin in
apply_error_callback() but I guess moving it to the PG_CATCH() block
in start_apply() might work.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#2)
Re: Disabled logical replication origin session causes primary key errors

On Sun, Apr 20, 2025 at 11:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thank you for the report and sharing the reproducers.

On Fri, Apr 18, 2025 at 2:22 PM Shawn McCoy <shawn.the.mccoy@gmail.com> wrote:

Hello,

We have discovered a recent regression in the Origin handling of logical replication apply workers. We have found the cause of the issue was due to the worker resetting its local origin session information during the processing of an error that is
silently handled allowing the worker to continue. We suspect this is caused by the recent change made in the following thread, /messages/by-id/TYAPR01MB5692FAC23BE40C69DA8ED4AFF5B92@TYAPR01MB5692.jpnprd01.prod.outlook.com.

The logical replication apply worker will originally setup the origin correctly. However, on the first insert will call into the trigger which will raise an exception. This exception will execute the error callback that resets the origin session state. The exception will then be silently handled, returning execution back to the apply worker. In the second reproduction, a function based index is used with the same result.

At this point, the apply worker can continue to commit these changes, but has cleared all local origin session state. As a result, we will not update our remote to local LSN mapping of the origin. Allowing for duplicate data to be applied.

I agree with your analysis. When the subscriber restarts the logical
replication, changes that happened since the last acknowledged LSN
would be replicated again.

With commit 3f28b2fcac33f, we reset the replication origin in
apply_error_callback() but I guess moving it to the PG_CATCH() block
in start_apply() might work.

Right. We have wrongly assumed in that commit that the apply worker
will exit after an ERROR, but as shown by this case, the ERROR could
be silently handled. So, +1, for moving replication origin reset to
PG_CATCH in start_apply.

--
With Regards,
Amit Kapila.

#4Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#3)
RE: Disabled logical replication origin session causes primary key errors

Dear members,

Thanks for reporting the issue.

Right. We have wrongly assumed in that commit that the apply worker
will exit after an ERROR, but as shown by this case, the ERROR could
be silently handled. So, +1, for moving replication origin reset to
PG_CATCH in start_apply.

I was an author of original commit, so let me take initiative. When I was working
for 3f28b2fcac, I could not find path which ERROR is reported but worker can
survive so that I added replorigin_reset() in apply_error_callback(). The reported
case, however, the exception could be raised but the insert itself is committed.
In this case worker can continue working.

Attached patches have proposed changes. I did 1) meson test, 2) workloads provided
in [1]/messages/by-id/CALsgZNCGARa2mcYNVTSj9uoPcJo-tPuWUGECReKpNgTpo31_Pw@mail.gmail.com, and 3) manual tests done in original thread [2]/messages/by-id/CAJpy0uAa3M9sC3GWTUA+YPTnYxEcorFt_7g6rXz05jNcAGdjgQ@mail.gmail.com, and all of them could be
passed. The version is 2 because of the self-reviewing.

One note is that geterrlevel() is removed for HEAD patch but retained for PG16/PG17.
The function is exported, and APIs cannot be changed in back branches.

How do you feel?

[1]: /messages/by-id/CALsgZNCGARa2mcYNVTSj9uoPcJo-tPuWUGECReKpNgTpo31_Pw@mail.gmail.com
[2]: /messages/by-id/CAJpy0uAa3M9sC3GWTUA+YPTnYxEcorFt_7g6rXz05jNcAGdjgQ@mail.gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v2-HEAD-0001-Fix-oversight-3f28b2f.patchapplication/octet-stream; name=v2-HEAD-0001-Fix-oversight-3f28b2f.patchDownload+9-30
v2-PG16-PG17-0001-Fix-oversight-3f28b2f.patchapplication/octet-stream; name=v2-PG16-PG17-0001-Fix-oversight-3f28b2f.patchDownload+9-12
#5vignesh C
vignesh21@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#4)
Re: Disabled logical replication origin session causes primary key errors

On Mon, 21 Apr 2025 at 15:38, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear members,

Thanks for reporting the issue.

Right. We have wrongly assumed in that commit that the apply worker
will exit after an ERROR, but as shown by this case, the ERROR could
be silently handled. So, +1, for moving replication origin reset to
PG_CATCH in start_apply.

I was an author of original commit, so let me take initiative. When I was working
for 3f28b2fcac, I could not find path which ERROR is reported but worker can
survive so that I added replorigin_reset() in apply_error_callback(). The reported
case, however, the exception could be raised but the insert itself is committed.
In this case worker can continue working.

Attached patches have proposed changes. I did 1) meson test, 2) workloads provided
in [1], and 3) manual tests done in original thread [2], and all of them could be
passed. The version is 2 because of the self-reviewing.

One note is that geterrlevel() is removed for HEAD patch but retained for PG16/PG17.
The function is exported, and APIs cannot be changed in back branches.

How do you feel?

I was able to reproduce the issue with the steps suggested by Shawn
and your patch fixes the issue. Your suggested changes look good. One
thought, do you feel we should include a test for this.

Regards,
Vignesh

#6Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#4)
Re: Disabled logical replication origin session causes primary key errors

On Mon, Apr 21, 2025 at 3:08 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear members,

Thanks for reporting the issue.

Right. We have wrongly assumed in that commit that the apply worker
will exit after an ERROR, but as shown by this case, the ERROR could
be silently handled. So, +1, for moving replication origin reset to
PG_CATCH in start_apply.

I was an author of original commit, so let me take initiative. When I was working
for 3f28b2fcac, I could not find path which ERROR is reported but worker can
survive so that I added replorigin_reset() in apply_error_callback(). The reported
case, however, the exception could be raised but the insert itself is committed.
In this case worker can continue working.

Attached patches have proposed changes. I did 1) meson test, 2) workloads provided
in [1], and 3) manual tests done in original thread [2], and all of them could be
passed. The version is 2 because of the self-reviewing.

One note is that geterrlevel() is removed for HEAD patch but retained for PG16/PG17.
The function is exported, and APIs cannot be changed in back branches.

Thank you for the patch. The changes look reasonable to me. Can we add
some regression tests for that?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#5)
Re: Disabled logical replication origin session causes primary key errors

On Mon, Apr 21, 2025 at 9:47 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, 21 Apr 2025 at 15:38, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear members,

Thanks for reporting the issue.

Right. We have wrongly assumed in that commit that the apply worker
will exit after an ERROR, but as shown by this case, the ERROR could
be silently handled. So, +1, for moving replication origin reset to
PG_CATCH in start_apply.

I was an author of original commit, so let me take initiative. When I was working
for 3f28b2fcac, I could not find path which ERROR is reported but worker can
survive so that I added replorigin_reset() in apply_error_callback(). The reported
case, however, the exception could be raised but the insert itself is committed.
In this case worker can continue working.

Attached patches have proposed changes. I did 1) meson test, 2) workloads provided
in [1], and 3) manual tests done in original thread [2], and all of them could be
passed. The version is 2 because of the self-reviewing.

One note is that geterrlevel() is removed for HEAD patch but retained for PG16/PG17.
The function is exported, and APIs cannot be changed in back branches.

How do you feel?

I was able to reproduce the issue with the steps suggested by Shawn
and your patch fixes the issue.

Thanks, Vignesh, for testing the fix. Shawn, it would be good if you
could confirm the fix at your end as well.

--
With Regards,
Amit Kapila.

#8Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Masahiko Sawada (#6)
RE: Disabled logical replication origin session causes primary key errors

Dear Sawada-san,

Thank you for the patch. The changes look reasonable to me. Can we add
some regression tests for that?

Yeah it should be added. I think it is enough to add the first scenario. Done in
100_bugs.pl.
I also ran pgindent/pgperltidy. Because of self-reviewing the version became v4.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v4-HEAD-0001-Fix-oversight-3f28b2f.patchapplication/octet-stream; name=v4-HEAD-0001-Fix-oversight-3f28b2f.patchDownload+87-30
v4-PG16-17-0001-Fix-oversight-3f28b2f.patchapplication/octet-stream; name=v4-PG16-17-0001-Fix-oversight-3f28b2f.patchDownload+87-12
#9vignesh C
vignesh21@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#8)
Re: Disabled logical replication origin session causes primary key errors

On Tue, 22 Apr 2025 at 14:36, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Sawada-san,

Thank you for the patch. The changes look reasonable to me. Can we add
some regression tests for that?

Yeah it should be added. I think it is enough to add the first scenario. Done in
100_bugs.pl.
I also ran pgindent/pgperltidy. Because of self-reviewing the version became v4.

Couple of minor suggestions in the test:
1) I felt this is not required for this test as it has been verified many times:
+
+# Insert tuples and confirms replication works well
+$node_publisher->safe_psql('postgres', "INSERT INTO t1 VALUES (1);");
+
+$node_publisher->wait_for_catchup('regress_sub');
2) How about we change:
+# Confirms the origin can be advanced
+ok( $node_subscriber->poll_query_until(
+               'postgres',
+               "SELECT remote_lsn > '$remote_lsn' FROM
pg_catalog.pg_replication_origin_status os, pg_catalog.pg_subscription
s WHERE os.external_id = 'pg_' || s.oid AND s.subname =
'regress_sub'",
+               't')
+         or die "Timed out while waiting for replication origin to be
updated");

to:
$node_publisher->wait_for_catchup('regress_sub');

# Confirms the origin can be advanced
$result = $node_subscriber->safe_psql('postgres',
"SELECT remote_lsn > '$remote_lsn' FROM
pg_catalog.pg_replication_origin_status os, pg_catalog.pg_subscription
s WHERE os.external_id = 'pg_' || s.oid AND s.subname = 'regress_sub'"
);
is($result, 't',
'remote_lsn has advanced for apply worker raising an exception');

In that way, we need not wait on poll_query_until.

Regards,
Vignesh

#10Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: vignesh C (#9)
RE: Disabled logical replication origin session causes primary key errors

Dear Vignesh,

Couple of minor suggestions in the test:
1) I felt this is not required for this test as it has been verified many times:
+
+# Insert tuples and confirms replication works well
+$node_publisher->safe_psql('postgres', "INSERT INTO t1 VALUES (1);");
+
+$node_publisher->wait_for_catchup('regress_sub');

I added it to ensure remote_lsn to the valid value, but yes it is not mandatory.

2) How about we change:
+# Confirms the origin can be advanced
+ok( $node_subscriber->poll_query_until(
+               'postgres',
+               "SELECT remote_lsn > '$remote_lsn' FROM
pg_catalog.pg_replication_origin_status os, pg_catalog.pg_subscription
s WHERE os.external_id = 'pg_' || s.oid AND s.subname =
'regress_sub'",
+               't')
+         or die "Timed out while waiting for replication origin to be
updated");

to:
$node_publisher->wait_for_catchup('regress_sub');

# Confirms the origin can be advanced
$result = $node_subscriber->safe_psql('postgres',
"SELECT remote_lsn > '$remote_lsn' FROM
pg_catalog.pg_replication_origin_status os, pg_catalog.pg_subscription
s WHERE os.external_id = 'pg_' || s.oid AND s.subname = 'regress_sub'"
);
is($result, 't',
'remote_lsn has advanced for apply worker raising an exception');

In that way, we need not wait on poll_query_until.

I intentionally used poll_query_until(), but I have no strong opinions.
Modified.

PSA new patches.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v5-PG16-PG17-0001-Fix-oversight-3f28b2f.patchapplication/octet-stream; name=v5-PG16-PG17-0001-Fix-oversight-3f28b2f.patchDownload+83-12
v5-HEAD-0001-Fix-oversight-3f28b2f.patchapplication/octet-stream; name=v5-HEAD-0001-Fix-oversight-3f28b2f.patchDownload+83-30
#11vignesh C
vignesh21@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#10)
Re: Disabled logical replication origin session causes primary key errors

On Tue, 22 Apr 2025 at 16:31, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Vignesh,

Couple of minor suggestions in the test:
1) I felt this is not required for this test as it has been verified many times:
+
+# Insert tuples and confirms replication works well
+$node_publisher->safe_psql('postgres', "INSERT INTO t1 VALUES (1);");
+
+$node_publisher->wait_for_catchup('regress_sub');

I added it to ensure remote_lsn to the valid value, but yes it is not mandatory.

2) How about we change:
+# Confirms the origin can be advanced
+ok( $node_subscriber->poll_query_until(
+               'postgres',
+               "SELECT remote_lsn > '$remote_lsn' FROM
pg_catalog.pg_replication_origin_status os, pg_catalog.pg_subscription
s WHERE os.external_id = 'pg_' || s.oid AND s.subname =
'regress_sub'",
+               't')
+         or die "Timed out while waiting for replication origin to be
updated");

to:
$node_publisher->wait_for_catchup('regress_sub');

# Confirms the origin can be advanced
$result = $node_subscriber->safe_psql('postgres',
"SELECT remote_lsn > '$remote_lsn' FROM
pg_catalog.pg_replication_origin_status os, pg_catalog.pg_subscription
s WHERE os.external_id = 'pg_' || s.oid AND s.subname = 'regress_sub'"
);
is($result, 't',
'remote_lsn has advanced for apply worker raising an exception');

In that way, we need not wait on poll_query_until.

I intentionally used poll_query_until(), but I have no strong opinions.
Modified.

PSA new patches.

Thanks for the updated patch. I’ve reviewed and verified the changes
across the HEAD, PG17, and PG16 branches, and everything is working as
expected. I have no further comments.

Regards,
Vignesh

#12Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#10)
Re: Disabled logical replication origin session causes primary key errors

On Tue, Apr 22, 2025 at 4:01 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Vignesh,

Couple of minor suggestions in the test:
1) I felt this is not required for this test as it has been verified many times:
+
+# Insert tuples and confirms replication works well
+$node_publisher->safe_psql('postgres', "INSERT INTO t1 VALUES (1);");
+
+$node_publisher->wait_for_catchup('regress_sub');

I added it to ensure remote_lsn to the valid value, but yes it is not mandatory.

2) How about we change:
+# Confirms the origin can be advanced
+ok( $node_subscriber->poll_query_until(
+               'postgres',
+               "SELECT remote_lsn > '$remote_lsn' FROM
pg_catalog.pg_replication_origin_status os, pg_catalog.pg_subscription
s WHERE os.external_id = 'pg_' || s.oid AND s.subname =
'regress_sub'",
+               't')
+         or die "Timed out while waiting for replication origin to be
updated");

to:
$node_publisher->wait_for_catchup('regress_sub');

# Confirms the origin can be advanced
$result = $node_subscriber->safe_psql('postgres',
"SELECT remote_lsn > '$remote_lsn' FROM
pg_catalog.pg_replication_origin_status os, pg_catalog.pg_subscription
s WHERE os.external_id = 'pg_' || s.oid AND s.subname = 'regress_sub'"
);
is($result, 't',
'remote_lsn has advanced for apply worker raising an exception');

In that way, we need not wait on poll_query_until.

I intentionally used poll_query_until(), but I have no strong opinions.
Modified.

PSA new patches.

Thank you for updating the patch! Here are some comments:

+# The bug was that the replication origin wasn’t updated whe
+# apply_error_callback() was called with elevel >= ERROR, and the apply worker
+# continued running afterward.

I think it would be better to mention the fact that the problem
happened when an error was caught for instance by a plpgsql function.
How about rewriting it as follows?

# The bug was that when an ERROR was caught, for instance by a
PL/pgSQL function,
# the apply worker reset the replication origin but continued processing
# subsequent changes. This behavior resulted in a failure to update
the replication
# origin during further apply operations.

---
+# Define an after-trigger function for the table insert. It can be fired even
+# by the apply worker and always raises an exception. This situation allows
+# worker continue after apply_error_callback() is called with elevel = ERROR.
+$node_subscriber->safe_psql(
+        'postgres', q{
+CREATE FUNCTION handle_exception_trigger()
+RETURNS TRIGGER AS $$
+BEGIN
+        BEGIN
+                -- Raise an exception
+                RAISE EXCEPTION 'This is a test exception';
+        EXCEPTION
+                WHEN OTHERS THEN
+                        RETURN NEW;
+        END;
+
+        RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+});
+
+$node_subscriber->safe_psql(
+        'postgres', q{
+CREATE TRIGGER silent_exception_trigger
+AFTER INSERT OR UPDATE ON t1
+FOR EACH ROW
+EXECUTE FUNCTION handle_exception_trigger();
+
+ALTER TABLE t1 ENABLE ALWAYS TRIGGER silent_exception_trigger;
+});

How about rewriting the comment as follows?

# Create an AFTER INSERT trigger on the table that raises and subsequently
# handles an exception. Subsequent insertions will trigger this exception,
# causing the apply worker to invoke its error callback with an ERROR. However,
# since the error is caught within the trigger, the apply worker will continue
# processing changes.

And can we execute these queries in one safe_psql() call?

---
+# Obtain current remote_lsn value to check its advancement later
+my $remote_lsn = $node_subscriber->safe_psql('postgres',
+        "SELECT remote_lsn FROM
pg_catalog.pg_replication_origin_status os, pg_catalog.pg_subscription
s WHERE os.external_id = 'pg_' || s.oid AND s.subname = 'regress_sub'"
+);

It seems to make sense to me to get the remote_lsn value just before
executing INSERT after creating the trigger.

Is it a conventional way to always use schema-qualified catalogs names
in regression tests? Looking at other tests in src/test/subscription,
there are only three cases:

% git grep pg_catalog src/test/subscription/t
src/test/subscription/t/001_rep_changes.pl: FROM pg_catalog.pg_stat_io
src/test/subscription/t/020_messages.pl: "SELECT COUNT(*) FROM
pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND
active='f'",
src/test/subscription/t/029_on_error.pl: "SELECT subenabled = false
FROM pg_catalog.pg_subscription WHERE subname = 'sub'"

ISTM that we don't necessarily need to make the catalog name schema-qualified.

---
We might want to stop both the publisher and the subscriber at the end
of the tests.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#13Shawn McCoy
shawn.the.mccoy@gmail.com
In reply to: Amit Kapila (#7)
Re: Disabled logical replication origin session causes primary key errors

Thanks, Vignesh, for testing the fix. Shawn, it would be good if you
could confirm the fix at your end as well.

I tested the patch and confirmed on versions 16.8, 17.4 and HEAD that
the pgbench / functional index version repro is indeed fixed.

Thanks,
Shawn

#14Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Masahiko Sawada (#12)
RE: Disabled logical replication origin session causes primary key errors

Dear Sawada-san,

Thanks for giving comments!

+# The bug was that the replication origin wasn’t updated whe
+# apply_error_callback() was called with elevel >= ERROR, and the apply
worker
+# continued running afterward.

I think it would be better to mention the fact that the problem
happened when an error was caught for instance by a plpgsql function.
How about rewriting it as follows?

# The bug was that when an ERROR was caught, for instance by a
PL/pgSQL function,
# the apply worker reset the replication origin but continued processing
# subsequent changes. This behavior resulted in a failure to update
the replication
# origin during further apply operations.

I tried to describe the internal reasons of bugs, but yours did reported facts.
+1, replaced.

---
+# Define an after-trigger function for the table insert. It can be fired even
+# by the apply worker and always raises an exception. This situation allows
+# worker continue after apply_error_callback() is called with elevel = ERROR.
+$node_subscriber->safe_psql(
+        'postgres', q{
+CREATE FUNCTION handle_exception_trigger()
+RETURNS TRIGGER AS $$
+BEGIN
+        BEGIN
+                -- Raise an exception
+                RAISE EXCEPTION 'This is a test exception';
+        EXCEPTION
+                WHEN OTHERS THEN
+                        RETURN NEW;
+        END;
+
+        RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+});
+
+$node_subscriber->safe_psql(
+        'postgres', q{
+CREATE TRIGGER silent_exception_trigger
+AFTER INSERT OR UPDATE ON t1
+FOR EACH ROW
+EXECUTE FUNCTION handle_exception_trigger();
+
+ALTER TABLE t1 ENABLE ALWAYS TRIGGER silent_exception_trigger;
+});

How about rewriting the comment as follows?

# Create an AFTER INSERT trigger on the table that raises and subsequently
# handles an exception. Subsequent insertions will trigger this exception,
# causing the apply worker to invoke its error callback with an ERROR. However,
# since the error is caught within the trigger, the apply worker will continue
# processing changes.

Fixed.

And can we execute these queries in one safe_psql() call?

Yes possible. I intentionally separated to make it clearer, but I did not have
strong opinions. Fixed.

---
+# Obtain current remote_lsn value to check its advancement later
+my $remote_lsn = $node_subscriber->safe_psql('postgres',
+        "SELECT remote_lsn FROM
pg_catalog.pg_replication_origin_status os, pg_catalog.pg_subscription
s WHERE os.external_id = 'pg_' || s.oid AND s.subname = 'regress_sub'"
+);

It seems to make sense to me to get the remote_lsn value just before
executing INSERT after creating the trigger.

Moved.

Is it a conventional way to always use schema-qualified catalogs names
in regression tests? Looking at other tests in src/test/subscription,
there are only three cases:

% git grep pg_catalog src/test/subscription/t
src/test/subscription/t/001_rep_changes.pl: FROM
pg_catalog.pg_stat_io
src/test/subscription/t/020_messages.pl: "SELECT COUNT(*) FROM
pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND
active='f'",
src/test/subscription/t/029_on_error.pl: "SELECT subenabled = false
FROM pg_catalog.pg_subscription WHERE subname = 'sub'"

ISTM that we don't necessarily need to make the catalog name schema-qualified.

I referred parts of Cluster.pm and 040_standby_failover_slots_sync.pl, and they
had "pg_catalog" prefix. After considering more, instances would be created within
the test and we ensure to connect to them - we can assume they are safe, OK to
remove it.

---
We might want to stop both the publisher and the subscriber at the end
of the tests.

Opps, added.

Attached patch could pass tests on ,my env, and pgperltidy said OK.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v6-PG16-PG17-0001-Fix-oversight-3f28b2f.patchapplication/octet-stream; name=v6-PG16-PG17-0001-Fix-oversight-3f28b2f.patchDownload+86-12
v6-HEAD-0001-Fix-oversight-3f28b2f.patchapplication/octet-stream; name=v6-HEAD-0001-Fix-oversight-3f28b2f.patchDownload+86-30
#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#14)
Re: Disabled logical replication origin session causes primary key errors

On Wed, Apr 23, 2025 at 7:11 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

+# The bug was that the replication origin wasn’t updated whe
+# apply_error_callback() was called with elevel >= ERROR, and the apply
worker
+# continued running afterward.

I think it would be better to mention the fact that the problem
happened when an error was caught for instance by a plpgsql function.
How about rewriting it as follows?

# The bug was that when an ERROR was caught, for instance by a
PL/pgSQL function,
# the apply worker reset the replication origin but continued processing
# subsequent changes. This behavior resulted in a failure to update
the replication
# origin during further apply operations.

I tried to describe the internal reasons of bugs, but yours did reported facts.
+1, replaced.

Pushed the patch after slightly changing the comments.

--
With Regards,
Amit Kapila.

#16Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#15)
Re: Disabled logical replication origin session causes primary key errors

On Wed, Apr 23, 2025 at 2:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Apr 23, 2025 at 7:11 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

+# The bug was that the replication origin wasn’t updated whe
+# apply_error_callback() was called with elevel >= ERROR, and the apply
worker
+# continued running afterward.

I think it would be better to mention the fact that the problem
happened when an error was caught for instance by a plpgsql function.
How about rewriting it as follows?

# The bug was that when an ERROR was caught, for instance by a
PL/pgSQL function,
# the apply worker reset the replication origin but continued processing
# subsequent changes. This behavior resulted in a failure to update
the replication
# origin during further apply operations.

I tried to describe the internal reasons of bugs, but yours did reported facts.
+1, replaced.

Pushed the patch after slightly changing the comments.

Thank you!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#17Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#16)
Re: Disabled logical replication origin session causes primary key errors

On Wed, Apr 23, 2025 at 08:58:27AM -0700, Masahiko Sawada wrote:

On Wed, Apr 23, 2025 at 2:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Pushed the patch after slightly changing the comments.

Thank you!

+1.
--
Michael