Subscription sometimes loses txns after initial table sync

Started by Pritam Baralabout 1 year ago4 messages
#1Pritam Baral
pritam@pritambaral.com
1 attachment(s)

This was discovered when testing the plan for a major version upgrade via
logical replication. Said plan requires that some tables be synced before
others. So I implemented it using ALTER PUBLICATION ... ADD TABLE ... followed
by ALTER SUBSCRIPTION ... REFRESH PUBLICATION. A test for correctness revealed
that sometimes, for some tables added this way, txns after the initial data copy
are lost by the subscription.

A reproducer script is attached. It has been tested with PG 17.2, 14.15, and
even 12.22 (on either side of the replication setup). The script runs at a
default scale of 100 tables with 10k inserts each. This scale is enough to
demonstrate a failure rate of 1% to 9% of tables on my modest laptop.

In attempts to analyse why this happens, it has been observed that the sender
sometimes does not pick up a published table, even when the receiver that
started the sender process has seen the table as available (as returned by
pg_get_publication_tables()) and has thus begun COPYing its data. When the COPY
finishes (and the tablesync worker is finished), the apply loop on the receiver
expects to receive (and apply) subsequent changes for such tables, but simply
isn't sent any. This was observed by dumping every CopyData message sent over
the wire.

The attached script (like the original migration plan) uses a single publication
and adds tables to it successively. Curiously, when the script was changed to
use a dedicated publication per table (and thus, ALTER SUBSCRIPTION ... ADD
PUBLICATION instead of ALTER SUBSCRIPTION ... REFRESH PUBLICATION), the no. of
tables with data loss jumped to 100%.

--
#!/usr/bin/env regards
Chhatoi Pritam Baral

Attachments:

sub-loss-repro.shapplication/x-shellscript; name=sub-loss-repro.shDownload
#2Pritam Baral
pritam@pritambaral.com
In reply to: Pritam Baral (#1)
1 attachment(s)
Re: Subscription sometimes loses txns after initial table sync

On 09/12/24 18:50, Pritam Baral wrote:

A reproducer script is attached.

Apologies. The aforementioned script is broken. It was a poor port from an internal application.

A corrected reproducer script is attached.

--
#!/usr/bin/env regards
Chhatoi Pritam Baral

Attachments:

sub-loss-repro2.shapplication/x-shellscript; name=sub-loss-repro2.shDownload
#3Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Pritam Baral (#1)
RE: Subscription sometimes loses txns after initial table sync

On Monday, December 9, 2024 9:21 PM Pritam Baral <pritam@pritambaral.com> wrote:

To: pgsql-hackers <pgsql-hackers@postgresql.org>
Subject: Subscription sometimes loses txns after initial table sync

This was discovered when testing the plan for a major version upgrade via
logical replication. Said plan requires that some tables be synced before
others. So I implemented it using ALTER PUBLICATION ... ADD TABLE ...
followed
by ALTER SUBSCRIPTION ... REFRESH PUBLICATION. A test for correctness
revealed
that sometimes, for some tables added this way, txns after the initial data copy
are lost by the subscription.

A reproducer script is attached. It has been tested with PG 17.2, 14.15, and
even 12.22 (on either side of the replication setup). The script runs at a
default scale of 100 tables with 10k inserts each. This scale is enough to
demonstrate a failure rate of 1% to 9% of tables on my modest laptop.

In attempts to analyse why this happens, it has been observed that the sender
sometimes does not pick up a published table, even when the receiver that
started the sender process has seen the table as available (as returned by
pg_get_publication_tables()) and has thus begun COPYing its data. When the
COPY
finishes (and the tablesync worker is finished), the apply loop on the receiver
expects to receive (and apply) subsequent changes for such tables, but simply
isn't sent any. This was observed by dumping every CopyData message sent
over
the wire.

The attached script (like the original migration plan) uses a single publication
and adds tables to it successively. Curiously, when the script was changed to
use a dedicated publication per table (and thus, ALTER SUBSCRIPTION ...
ADD
PUBLICATION instead of ALTER SUBSCRIPTION ... REFRESH PUBLICATION),
the no. of
tables with data loss jumped to 100%.

Thanks for reporting the issue.

The described behavior looks similar to another bug discussed in [1]/messages/by-id/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com. If
possible, could you please check if the latest patch in that thread can fix the
bug you reported ?

If it does, it would be helpful to share the feedback in that thread.

[1]: /messages/by-id/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com

Best Regards,
Hou zj

#4Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#3)
Re: Subscription sometimes loses txns after initial table sync

On Tue, 10 Dec 2024 at 07:24, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Monday, December 9, 2024 9:21 PM Pritam Baral <pritam@pritambaral.com> wrote:

To: pgsql-hackers <pgsql-hackers@postgresql.org>
Subject: Subscription sometimes loses txns after initial table sync

This was discovered when testing the plan for a major version upgrade via
logical replication. Said plan requires that some tables be synced before
others. So I implemented it using ALTER PUBLICATION ... ADD TABLE ...
followed
by ALTER SUBSCRIPTION ... REFRESH PUBLICATION. A test for correctness
revealed
that sometimes, for some tables added this way, txns after the initial data copy
are lost by the subscription.

A reproducer script is attached. It has been tested with PG 17.2, 14.15, and
even 12.22 (on either side of the replication setup). The script runs at a
default scale of 100 tables with 10k inserts each. This scale is enough to
demonstrate a failure rate of 1% to 9% of tables on my modest laptop.

In attempts to analyse why this happens, it has been observed that the sender
sometimes does not pick up a published table, even when the receiver that
started the sender process has seen the table as available (as returned by
pg_get_publication_tables()) and has thus begun COPYing its data. When the
COPY
finishes (and the tablesync worker is finished), the apply loop on the receiver
expects to receive (and apply) subsequent changes for such tables, but simply
isn't sent any. This was observed by dumping every CopyData message sent
over
the wire.

The attached script (like the original migration plan) uses a single publication
and adds tables to it successively. Curiously, when the script was changed to
use a dedicated publication per table (and thus, ALTER SUBSCRIPTION ...
ADD
PUBLICATION instead of ALTER SUBSCRIPTION ... REFRESH PUBLICATION),
the no. of
tables with data loss jumped to 100%.

Thanks for reporting the issue.

The described behavior looks similar to another bug discussed in [1]. If
possible, could you please check if the latest patch in that thread can fix the
bug you reported ?

If it does, it would be helpful to share the feedback in that thread.

[1] /messages/by-id/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com

Hi,

I tried to reproduce the issue on HEAD and REL_17_STABLE branches. I
found that the issue is intermittent for me. I ran the script,
provided in [1]/messages/by-id/8b595156-d8b6-4b53-a788-7d945726cd2f@pritambaral.com, 50 times on both branches and I was able to reproduce
the issue 4 times and 5 times respectively.
Then I tested both the branches after applying patches in [2]/messages/by-id/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com and ran
the script 50 times. I was not able to reproduce the issue with patch.

I think as Hou-san suggested, the patches in [2]/messages/by-id/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com can fix this issue.

[1]: /messages/by-id/8b595156-d8b6-4b53-a788-7d945726cd2f@pritambaral.com
[2]: /messages/by-id/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com

Thanks and Regards,
Shlok Kyal