Build-farm - intermittent error in 031_column_list.pl

Started by Peter Smithover 3 years ago18 messages
#1Peter Smith
Peter Smith
smithpb2250@gmail.com

Hi hackers.

FYI, I saw that there was a recent Build-farm error on the "grison" machine [1]https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD
[1]: https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD

The error happened during "subscriptionCheck" phase in the TAP test
t/031_column_list.pl
This test file was added by this [2]https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5 commit.
[2]: https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5

~~

I checked the history of fails for that TAP test t/031_column_list.pl
and found that this same error seems to have been happening
intermittently for at least the last 50 days.

Details of similar previous errors from the BF are listed below.

~~~

1. Details for system "grison" failure at stage subscriptionCheck,
snapshot taken 2022-05-18 18:11:45
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2022-05-18%2018%3A11%3A45

[22:02:08] t/029_on_error.pl .................. ok 25475 ms ( 0.01
usr 0.00 sys + 15.39 cusr 5.59 csys = 20.99 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/1530588' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 22.
[22:09:25] t/031_column_list.pl ...............
...
[22:02:47.887](1.829s) ok 22 - partitions with different replica
identities not replicated correctly Waiting for replication conn
sub1's replay_lsn to pass 0/1530588 on publisher
[22:09:25.395](397.508s) # poll_query_until timed out executing this query:
# SELECT '0/1530588' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
timed out waiting for catchup at t/031_column_list.pl line 728.
### Stopping node "publisher" using mode immediate

~~~

2. Details for system "xenodermus" failure at stage subscriptionCheck,
snapshot taken 2022-04-16 21:00:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&amp;dt=2022-04-16%2021%3A00%3A04

[00:15:32] t/029_on_error.pl .................. ok 8278 ms ( 0.00
usr 0.00 sys + 1.33 cusr 0.55 csys = 1.88 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/1543648' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 25.
[00:22:30] t/031_column_list.pl ...............
...
[00:16:04.100](0.901s) ok 25 - partitions with different replica
identities not replicated correctly Waiting for replication conn
sub1's replay_lsn to pass 0/1543648 on publisher
[00:22:29.923](385.823s) # poll_query_until timed out executing this query:
# SELECT '0/1543648' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
timed out waiting for catchup at t/031_column_list.pl line 818.

~~~

3. Details for system "phycodurus" failure at stage subscriptionCheck,
snapshot taken 2022-04-05 17:30:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&amp;dt=2022-04-05%2017%3A30%3A04

# poll_query_until timed out executing this query:
# SELECT '0/1528640' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 22.
[20:50:25] t/031_column_list.pl ...............
...
ok 22 - partitions with different replica identities not replicated
correctly Waiting for replication conn sub1's replay_lsn to pass
0/1528640 on publisher # poll_query_until timed out executing this
query:
# SELECT '0/1528640' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
timed out waiting for catchup at t/031_column_list.pl line 667.

~~~

4. Details for system "phycodurus" failure at stage subscriptionCheck,
snapshot taken 2022-04-05 17:30:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&amp;dt=2022-04-05%2017%3A30%3A04

[20:43:04] t/030_sequences.pl ................. ok 11108 ms ( 0.00
usr 0.00 sys + 1.49 cusr 0.40 csys = 1.89 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/1528640' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 22.
[20:50:25] t/031_column_list.pl ...............
...
ok 22 - partitions with different replica identities not replicated
correctly Waiting for replication conn sub1's replay_lsn to pass
0/1528640 on publisher # poll_query_until timed out executing this
query:
# SELECT '0/1528640' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
timed out waiting for catchup at t/031_column_list.pl line 667.

~~~

5. Details for system "grison" failure at stage subscriptionCheck,
snapshot taken 2022-04-03 18:11:39
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&amp;dt=2022-04-03%2018%3A11%3A39

[22:28:00] t/030_sequences.pl ................. ok 22970 ms ( 0.01
usr 0.00 sys + 14.93 cusr 5.14 csys = 20.08 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/1528CF0' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 22.
[22:35:11] t/031_column_list.pl ...............
....
ok 22 - partitions with different replica identities not replicated
correctly Waiting for replication conn sub1's replay_lsn to pass
0/1528CF0 on publisher # poll_query_until timed out executing this
query:
# SELECT '0/1528CF0' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
timed out waiting for catchup at t/031_column_list.pl line 667.

----
Kind Regards,
Peter Smith.
Fujitsu Australia

#2Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Peter Smith (#1)
Re: Build-farm - intermittent error in 031_column_list.pl

At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2250@gmail.com> wrote in

Hi hackers.

FYI, I saw that there was a recent Build-farm error on the "grison" machine [1]
[1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&amp;br=HEAD

The error happened during "subscriptionCheck" phase in the TAP test
t/031_column_list.pl
This test file was added by this [2] commit.
[2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5

What is happening for all of them looks like that the name of a
publication created by CREATE PUBLICATION without a failure report is
missing for a walsender came later. It seems like CREATE PUBLICATION
can silently fail to create a publication, or walsender somehow failed
to find existing one.

~~

I checked the history of fails for that TAP test t/031_column_list.pl
and found that this same error seems to have been happening
intermittently for at least the last 50 days.

Details of similar previous errors from the BF are listed below.

~~~

1. Details for system "grison" failure at stage subscriptionCheck,
snapshot taken 2022-05-18 18:11:45
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&amp;dt=2022-05-18%2018%3A11%3A45

[22:02:08] t/029_on_error.pl .................. ok 25475 ms ( 0.01
usr 0.00 sys + 15.39 cusr 5.59 csys = 20.99 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/1530588' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 22.
[22:09:25] t/031_column_list.pl ...............
...
[22:02:47.887](1.829s) ok 22 - partitions with different replica
identities not replicated correctly Waiting for replication conn
sub1's replay_lsn to pass 0/1530588 on publisher
[22:09:25.395](397.508s) # poll_query_until timed out executing this query:
# SELECT '0/1530588' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
timed out waiting for catchup at t/031_column_list.pl line 728.
### Stopping node "publisher" using mode immediate

2022-04-17 00:16:04.278 CEST [293659][client backend][4/270:0][031_column_list.pl] LOG: statement: CREATE PUBLICATION pub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);
2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] LOG: disconnection: session time: 0:00:00.002 user=bf database=postgres host=[local]

"CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
the session has been disconnected. But the following request for the
same publication fails due to the absense of the publication.

2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88 (proto_version '3', publication_names '"pub9"')
2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR: publication "pub9" does not exist

~~~

2. Details for system "xenodermus" failure at stage subscriptionCheck,
snapshot taken 2022-04-16 21:00:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&amp;dt=2022-04-16%2021%3A00%3A04

The same. pub9 is missing after creation.

~~~

3. Details for system "phycodurus" failure at stage subscriptionCheck,
snapshot taken 2022-04-05 17:30:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&amp;dt=2022-04-05%2017%3A30%3A04

The same happens for pub7..

4. Details for system "phycodurus" failure at stage subscriptionCheck,
snapshot taken 2022-04-05 17:30:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&amp;dt=2022-04-05%2017%3A30%3A04

Same. pub7 is missing.

5. Details for system "grison" failure at stage subscriptionCheck,
snapshot taken 2022-04-03 18:11:39
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&amp;dt=2022-04-03%2018%3A11%3A39

Same. pub7 is missing.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#2)
Re: Build-farm - intermittent error in 031_column_list.pl

On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2250@gmail.com> wrote in

Hi hackers.

FYI, I saw that there was a recent Build-farm error on the "grison" machine [1]
[1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&amp;br=HEAD

The error happened during "subscriptionCheck" phase in the TAP test
t/031_column_list.pl
This test file was added by this [2] commit.
[2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5

What is happening for all of them looks like that the name of a
publication created by CREATE PUBLICATION without a failure report is
missing for a walsender came later. It seems like CREATE PUBLICATION
can silently fail to create a publication, or walsender somehow failed
to find existing one.

Do you see anything in LOGS which indicates CREATE SUBSCRIPTION has failed?

~~

2022-04-17 00:16:04.278 CEST [293659][client backend][4/270:0][031_column_list.pl] LOG: statement: CREATE PUBLICATION pub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);
2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] LOG: disconnection: session time: 0:00:00.002 user=bf database=postgres host=[local]

"CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
the session has been disconnected. But the following request for the
same publication fails due to the absense of the publication.

2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88 (proto_version '3', publication_names '"pub9"')
2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR: publication "pub9" does not exist

This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The
probable theory is that ALTER SUBSCRIPTION will lead to restarting of
apply worker (which we can see in LOGS as well) and after the restart,
the apply worker will use the existing slot and replication origin
corresponding to the subscription. Now, it is possible that before
restart the origin has not been updated and the WAL start location
points to a location prior to where PUBLICATION pub9 exists which can
lead to such an error. Once this error occurs, apply worker will never
be able to proceed and will always return the same error. Does this
make sense?

Unless you or others see a different theory, this seems to be the
existing problem in logical replication which is manifested by this
test. If we just want to fix these test failures, we can create a new
subscription instead of altering the existing publication to point to
the new publication.

Note: Added Tomas to know his views as he has committed this test.

--
With Regards,
Amit Kapila.

#4Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#3)
Re: Build-farm - intermittent error in 031_column_list.pl

On Thu, May 19, 2022 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2250@gmail.com> wrote in

Hi hackers.

FYI, I saw that there was a recent Build-farm error on the "grison" machine [1]
[1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&amp;br=HEAD

The error happened during "subscriptionCheck" phase in the TAP test
t/031_column_list.pl
This test file was added by this [2] commit.
[2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5

What is happening for all of them looks like that the name of a
publication created by CREATE PUBLICATION without a failure report is
missing for a walsender came later. It seems like CREATE PUBLICATION
can silently fail to create a publication, or walsender somehow failed
to find existing one.

Do you see anything in LOGS which indicates CREATE SUBSCRIPTION has failed?

~~

2022-04-17 00:16:04.278 CEST [293659][client backend][4/270:0][031_column_list.pl] LOG: statement: CREATE PUBLICATION pub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);
2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] LOG: disconnection: session time: 0:00:00.002 user=bf database=postgres host=[local]

"CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
the session has been disconnected. But the following request for the
same publication fails due to the absense of the publication.

2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88 (proto_version '3', publication_names '"pub9"')
2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR: publication "pub9" does not exist

This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The
probable theory is that ALTER SUBSCRIPTION will lead to restarting of
apply worker (which we can see in LOGS as well) and after the restart,
the apply worker will use the existing slot and replication origin
corresponding to the subscription. Now, it is possible that before
restart the origin has not been updated and the WAL start location
points to a location prior to where PUBLICATION pub9 exists which can
lead to such an error. Once this error occurs, apply worker will never
be able to proceed and will always return the same error. Does this
make sense?

Unless you or others see a different theory, this seems to be the
existing problem in logical replication which is manifested by this
test. If we just want to fix these test failures, we can create a new
subscription instead of altering the existing publication to point to
the new publication.

If the above theory is correct then I think allowing the publisher to
catch up with "$node_publisher->wait_for_catchup('sub1');" before
ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
both publisher and subscriber are in sync then the new publication
should be visible to WALSender.

--
With Regards,
Amit Kapila.

#5Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#4)
Re: Build-farm - intermittent error in 031_column_list.pl

At Thu, 19 May 2022 16:42:31 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

On Thu, May 19, 2022 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The
probable theory is that ALTER SUBSCRIPTION will lead to restarting of
apply worker (which we can see in LOGS as well) and after the restart,

Yes.

the apply worker will use the existing slot and replication origin
corresponding to the subscription. Now, it is possible that before
restart the origin has not been updated and the WAL start location
points to a location prior to where PUBLICATION pub9 exists which can
lead to such an error. Once this error occurs, apply worker will never
be able to proceed and will always return the same error. Does this
make sense?

Wow. I didin't thought that line. That theory explains the silence and
makes sense even though I don't see LSN transistions that clearly
support it. I dimly remember a similar kind of problem..

Unless you or others see a different theory, this seems to be the
existing problem in logical replication which is manifested by this
test. If we just want to fix these test failures, we can create a new
subscription instead of altering the existing publication to point to
the new publication.

If the above theory is correct then I think allowing the publisher to
catch up with "$node_publisher->wait_for_catchup('sub1');" before
ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
both publisher and subscriber are in sync then the new publication
should be visible to WALSender.

It looks right to me. That timetravel seems inintuitive but it's the
(current) way it works.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#6Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#5)
Re: Build-farm - intermittent error in 031_column_list.pl

On Fri, May 20, 2022 at 6:58 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

the apply worker will use the existing slot and replication origin
corresponding to the subscription. Now, it is possible that before
restart the origin has not been updated and the WAL start location
points to a location prior to where PUBLICATION pub9 exists which can
lead to such an error. Once this error occurs, apply worker will never
be able to proceed and will always return the same error. Does this
make sense?

Wow. I didin't thought that line. That theory explains the silence and
makes sense even though I don't see LSN transistions that clearly
support it. I dimly remember a similar kind of problem..

Unless you or others see a different theory, this seems to be the
existing problem in logical replication which is manifested by this
test. If we just want to fix these test failures, we can create a new
subscription instead of altering the existing publication to point to
the new publication.

If the above theory is correct then I think allowing the publisher to
catch up with "$node_publisher->wait_for_catchup('sub1');" before
ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
both publisher and subscriber are in sync then the new publication
should be visible to WALSender.

It looks right to me.

Let's wait for Tomas or others working in this area to share their thoughts.

That timetravel seems inintuitive but it's the
(current) way it works.

I have thought about it but couldn't come up with a good way to change
the way currently it works. Moreover, I think it is easy to hit this
in other ways as well. Say, you first create a subscription with a
non-existent publication and then do operation on any unrelated table
on the publisher before creating the required publication, we will hit
exactly this problem of "publication does not exist", so I think we
may need to live with this behavior and write tests carefully.

--
With Regards,
Amit Kapila.

#7Tomas Vondra
Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Amit Kapila (#6)
Re: Build-farm - intermittent error in 031_column_list.pl

On 5/20/22 05:58, Amit Kapila wrote:

On Fri, May 20, 2022 at 6:58 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

the apply worker will use the existing slot and replication origin
corresponding to the subscription. Now, it is possible that before
restart the origin has not been updated and the WAL start location
points to a location prior to where PUBLICATION pub9 exists which can
lead to such an error. Once this error occurs, apply worker will never
be able to proceed and will always return the same error. Does this
make sense?

Wow. I didin't thought that line. That theory explains the silence and
makes sense even though I don't see LSN transistions that clearly
support it. I dimly remember a similar kind of problem..

Unless you or others see a different theory, this seems to be the
existing problem in logical replication which is manifested by this
test. If we just want to fix these test failures, we can create a new
subscription instead of altering the existing publication to point to
the new publication.

If the above theory is correct then I think allowing the publisher to
catch up with "$node_publisher->wait_for_catchup('sub1');" before
ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
both publisher and subscriber are in sync then the new publication
should be visible to WALSender.

It looks right to me.

Let's wait for Tomas or others working in this area to share their thoughts.

Are we really querying the publications (in get_rel_sync_entry) using
the historical snapshot? I haven't really realized this, but yeah, that
might explain the issue.

The new TAP test does ALTER SUBSCRIPTION ... SET PUBLICATION much more
often than any other test (there are ~15 calls, 12 of which are in this
new test). That might be why we haven't seen failures before. Or maybe
the existing tests simply are not vulnerable to this, because they
either do wait_for_catchup late enough or don't do any DML right before
executing SET PUBLICATION.

That timetravel seems inintuitive but it's the
(current) way it works.

I have thought about it but couldn't come up with a good way to change
the way currently it works. Moreover, I think it is easy to hit this
in other ways as well. Say, you first create a subscription with a
non-existent publication and then do operation on any unrelated table
on the publisher before creating the required publication, we will hit
exactly this problem of "publication does not exist", so I think we
may need to live with this behavior and write tests carefully.

Yeah, I think it pretty much requires ensuring the subscriber is fully
caught up with the publisher, otherwise ALTER SUBSCRIPTION may break the
replication in an unrecoverable way (actually, you can alter the
subscription and remove the publication again, right?).

But this is not just about tests, of course - the same issue applies to
regular replication. That's a bit unfortunate, so maybe we should think
about making this less fragile.

We might make sure the subscriber is not lagging (essentially the
wait_for_catchup) - which the users will have to do anyway (although
maybe they know the publisher is beyond the LSN where it was created).

The other option would be to detect such case, somehow - if you don't
see the publication yet, see if it exists in current snapshot, and then
maybe ignore this error. But that has other issues (the publication
might have been created and dropped, in which case you won't see it).
Also, we'd probably have to ignore RelationSyncEntry for a while, which
seems quite expensive.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#4)
1 attachment(s)
RE: Build-farm - intermittent error in 031_column_list.pl

On Thursday, May 19, 2022 8:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 19, 2022 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 19 May 2022 14:26:56 +1000, Peter Smith
<smithpb2250@gmail.com> wrote in

Hi hackers.

FYI, I saw that there was a recent Build-farm error on the
"grison" machine [1] [1]
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison
&br=HEAD

The error happened during "subscriptionCheck" phase in the TAP
test t/031_column_list.pl This test file was added by this [2]
commit.
[2]

https://github.com/postgres/postgres/commit/923def9a533a7d986acfb5

24139d8b9e5466d0a5

2022-04-17 00:16:04.278 CEST [293659][client
backend][4/270:0][031_column_list.pl] LOG: statement: CREATE
PUBLICATION pub9 FOR TABLE test_part_d (a) WITH
(publish_via_partition_root = true);
2022-04-17 00:16:04.279 CEST [293659][client
backend][:0][031_column_list.pl] LOG: disconnection: session time:
0:00:00.002 user=bf database=postgres host=[local]

"CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
the session has been disconnected. But the following request for the
same publication fails due to the absense of the publication.

2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1]
STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88
(proto_version '3', publication_names '"pub9"')
2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR:
publication "pub9" does not exist

This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9".

The

probable theory is that ALTER SUBSCRIPTION will lead to restarting of
apply worker (which we can see in LOGS as well) and after the restart,
the apply worker will use the existing slot and replication origin
corresponding to the subscription. Now, it is possible that before
restart the origin has not been updated and the WAL start location
points to a location prior to where PUBLICATION pub9 exists which can
lead to such an error. Once this error occurs, apply worker will never
be able to proceed and will always return the same error. Does this
make sense?

Unless you or others see a different theory, this seems to be the
existing problem in logical replication which is manifested by this
test. If we just want to fix these test failures, we can create a new
subscription instead of altering the existing publication to point to
the new publication.

If the above theory is correct then I think allowing the publisher to catch up with
"$node_publisher->wait_for_catchup('sub1');" before ALTER SUBSCRIPTION
should fix this problem. Because if before ALTER both publisher and
subscriber are in sync then the new publication should be visible to
WALSender.

Hi,

I've attached a patch for the fix proposed here.
First of all, thank you so much for helping me offlist, Amit-san.

I reproduced the failure like [1]the subscriber's log by commenting out
WalSndWaitForWal's call of WalSndKeepalive and running the test.
This comment out intends to suppress the advance of confirmed_flush location
after creating a publication.

In short, my understanding how the bug happened is,
1. we execute 'create publication pubX' and create one publication.
2. 'alter subscription subY set publication pubX' makes the apply worker exit
3. relaunched apply worker searches for pubX. But, the slot position(confirmed_flush)
doesn't get updated and points to some location before create publication at the publisher node.

Applying the attached patch have made the test pass.

[1]: the subscriber's log

2022-05-20 08:56:50.773 UTC [5153] 031_column_list.pl LOG: statement: ALTER SUBSCRIPTION sub1 SET PUBLICATION pub6
2022-05-20 08:56:50.801 UTC [5156] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');
2022-05-20 08:56:50.846 UTC [5112] LOG: logical replication apply worker for subscription "sub1" will restart because of a parameter change
2022-05-20 08:56:50.915 UTC [5158] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');
...
2022-05-20 08:56:51.257 UTC [5164] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');
2022-05-20 08:56:51.353 UTC [5166] LOG: logical replication apply worker for subscription "sub1" has started
2022-05-20 08:56:51.366 UTC [5168] LOG: logical replication table synchronization worker for subscription "sub1", table "test_part_a" has started
2022-05-20 08:56:51.370 UTC [5171] 031_column_list.pl LOG: statement: SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');
2022-05-20 08:56:51.373 UTC [5166] ERROR: could not receive data from WAL stream: ERROR: publication "pub6" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change callback, associated LSN 0/15C61B8
2022-05-20 08:56:51.374 UTC [4338] LOG: background worker "logical replication worker" (PID 5166) exited with exit code 1

Best Regards,
Takamichi Osumi

Attachments:

v1-0001-Fix-the-intermittent-Build-farm-failures-in-031_c.patchapplication/octet-stream; name=v1-0001-Fix-the-intermittent-Build-farm-failures-in-031_c.patch
#9Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#7)
Re: Build-farm - intermittent error in 031_column_list.pl

On Fri, May 20, 2022 at 4:01 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 5/20/22 05:58, Amit Kapila wrote:

Are we really querying the publications (in get_rel_sync_entry) using
the historical snapshot?

Yes.

I haven't really realized this, but yeah, that
might explain the issue.

The new TAP test does ALTER SUBSCRIPTION ... SET PUBLICATION much more
often than any other test (there are ~15 calls, 12 of which are in this
new test). That might be why we haven't seen failures before. Or maybe
the existing tests simply are not vulnerable to this,

Right, I have checked the other cases are not vulnerable to this,
otherwise, I think we would have seen intermittent failures till now.
They don't seem to be doing DMLs before the creation of a publication
or they create a subscription pointing to the same publication before.

because they
either do wait_for_catchup late enough or don't do any DML right before
executing SET PUBLICATION.

That timetravel seems inintuitive but it's the
(current) way it works.

I have thought about it but couldn't come up with a good way to change
the way currently it works. Moreover, I think it is easy to hit this
in other ways as well. Say, you first create a subscription with a
non-existent publication and then do operation on any unrelated table
on the publisher before creating the required publication, we will hit
exactly this problem of "publication does not exist", so I think we
may need to live with this behavior and write tests carefully.

Yeah, I think it pretty much requires ensuring the subscriber is fully
caught up with the publisher, otherwise ALTER SUBSCRIPTION may break the
replication in an unrecoverable way (actually, you can alter the
subscription and remove the publication again, right?).

Right.

But this is not just about tests, of course - the same issue applies to
regular replication. That's a bit unfortunate, so maybe we should think
about making this less fragile.

Agreed, provided we find some reasonable solution.

We might make sure the subscriber is not lagging (essentially the
wait_for_catchup) - which the users will have to do anyway (although
maybe they know the publisher is beyond the LSN where it was created).

This won't work for the case mentioned above where we create a
subscription with non-existent publications, then perform DML and then
'CREATE PUBLICATION'.

The other option would be to detect such case, somehow - if you don't
see the publication yet, see if it exists in current snapshot, and then
maybe ignore this error. But that has other issues (the publication
might have been created and dropped, in which case you won't see it).

True, the dropped case would again be tricky to deal with and I think
we will end up publishing some operations which are performed before
the publication is even created.

Also, we'd probably have to ignore RelationSyncEntry for a while, which
seems quite expensive.

Yet another option could be that we continue using a historic snapshot
but ignore publications that are not found for the purpose of
computing RelSyncEntry attributes. We won't mark such an entry as
valid till all the publications are loaded without anything missing. I
think such cases in practice won't be enough to matter. This means we
won't publish operations on tables corresponding to that publication
till we found such a publication and that seems okay.

--
With Regards,
Amit Kapila.

#10Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#9)
1 attachment(s)
Re: Build-farm - intermittent error in 031_column_list.pl

On Sat, May 21, 2022 at 9:03 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 20, 2022 at 4:01 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Also, we'd probably have to ignore RelationSyncEntry for a while, which
seems quite expensive.

Yet another option could be that we continue using a historic snapshot
but ignore publications that are not found for the purpose of
computing RelSyncEntry attributes. We won't mark such an entry as
valid till all the publications are loaded without anything missing. I
think such cases in practice won't be enough to matter. This means we
won't publish operations on tables corresponding to that publication
till we found such a publication and that seems okay.

Attached, find the patch to show what I have in mind for this. Today,
we have received a bug report with a similar symptom [1]/messages/by-id/CANWRaJyyD=9c1E2HdF-Tqfe7+vuCQnAkXd6+EFwxC0wM=313AA@mail.gmail.com and that
should also be fixed with this. The reported bug should also be fixed
with this.

Thoughts?

[1]: /messages/by-id/CANWRaJyyD=9c1E2HdF-Tqfe7+vuCQnAkXd6+EFwxC0wM=313AA@mail.gmail.com

--
With Regards,
Amit Kapila.

Attachments:

skip_missing_pubs.1.patchapplication/octet-stream; name=skip_missing_pubs.1.patch
#11osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#10)
RE: Build-farm - intermittent error in 031_column_list.pl

On Tuesday, May 24, 2022 9:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, May 21, 2022 at 9:03 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, May 20, 2022 at 4:01 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Also, we'd probably have to ignore RelationSyncEntry for a while,
which seems quite expensive.

Yet another option could be that we continue using a historic snapshot
but ignore publications that are not found for the purpose of
computing RelSyncEntry attributes. We won't mark such an entry as
valid till all the publications are loaded without anything missing. I
think such cases in practice won't be enough to matter. This means we
won't publish operations on tables corresponding to that publication
till we found such a publication and that seems okay.

Attached, find the patch to show what I have in mind for this. Today, we have
received a bug report with a similar symptom [1] and that should also be fixed
with this. The reported bug should also be fixed with this.

Thoughts?

Hi,

I agree with this direction.
I think this approach solves the issue fundamentally
and is better than the first approach to add several calls
of wait_for_catchup in the test, since taking the first one
means we need to care about avoiding the same issue,
whenever we write a new (similar) test, even after the modification.

I've used the patch to check below things.
1. The patch can be applied and make check-world has passed without failure.
2. HEAD applied with the patch passed all tests in src/test/subscription
(including 031_column_list.pl), after commenting out of WalSndWaitForWal's WalSndKeepalive.
3. The new bug fix report in 'How is this possible "publication does not exist"' thread
has been fixed. FYI, after I execute the script's function, I also conduct
additional insert to the publisher, and this was correctly replicated on the subscriber.

Best Regards,
Takamichi Osumi

#12Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#10)
Re: Build-farm - intermittent error in 031_column_list.pl

At Tue, 24 May 2022 18:19:45 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

On Sat, May 21, 2022 at 9:03 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 20, 2022 at 4:01 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Also, we'd probably have to ignore RelationSyncEntry for a while, which
seems quite expensive.

Yet another option could be that we continue using a historic snapshot
but ignore publications that are not found for the purpose of
computing RelSyncEntry attributes. We won't mark such an entry as
valid till all the publications are loaded without anything missing. I
think such cases in practice won't be enough to matter. This means we
won't publish operations on tables corresponding to that publication
till we found such a publication and that seems okay.

Attached, find the patch to show what I have in mind for this. Today,
we have received a bug report with a similar symptom [1] and that
should also be fixed with this. The reported bug should also be fixed
with this.

Thoughts?

[1] - /messages/by-id/CANWRaJyyD=9c1E2HdF-Tqfe7+vuCQnAkXd6+EFwxC0wM=313AA@mail.gmail.com

It does "fix" the case of [1]. But AFAIS
RelationSyncEntry.replicate_valid is only used to inhibit repeated
loading in get_rel_sync_entry and the function doesn't seem to be
assumed to return a invalid entry. (Since the flag is not checked
nowhere else.)

For example pgoutput_change does not check for the flag of the entry
returned from the function before uses it, which is not seemingly
safe. (I didn't check further, though)

Don't we need to explicitly avoid using invalid entries outside the
function?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#12)
1 attachment(s)
Re: Build-farm - intermittent error in 031_column_list.pl

On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

It does "fix" the case of [1]. But AFAIS
RelationSyncEntry.replicate_valid is only used to inhibit repeated
loading in get_rel_sync_entry and the function doesn't seem to be
assumed to return a invalid entry. (Since the flag is not checked
nowhere else.)

For example pgoutput_change does not check for the flag of the entry
returned from the function before uses it, which is not seemingly
safe. (I didn't check further, though)

Don't we need to explicitly avoid using invalid entries outside the
function?

We decide that based on pubactions in the callers, so even if entry is
valid, it won't do anything. Actually, we don't need to avoid setting
replication_valid flag as some of the publications for the table may
be already present. We can check if the publications_valid flag is set
while trying to validate the entry. Now, even if we don't find any
publications the replicate_valid flag will be set but none of the
actions will be set, so it won't do anything in the caller. Is this
better than the previous approach?

--
With Regards,
Amit Kapila.

Attachments:

skip_missing_pubs.2.patchapplication/octet-stream; name=skip_missing_pubs.2.patch
#14Tomas Vondra
Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Amit Kapila (#13)
Re: Build-farm - intermittent error in 031_column_list.pl

On 5/25/22 13:26, Amit Kapila wrote:

On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

It does "fix" the case of [1]. But AFAIS
RelationSyncEntry.replicate_valid is only used to inhibit repeated
loading in get_rel_sync_entry and the function doesn't seem to be
assumed to return a invalid entry. (Since the flag is not checked
nowhere else.)

For example pgoutput_change does not check for the flag of the entry
returned from the function before uses it, which is not seemingly
safe. (I didn't check further, though)

Don't we need to explicitly avoid using invalid entries outside the
function?

We decide that based on pubactions in the callers, so even if entry is
valid, it won't do anything. Actually, we don't need to avoid setting
replication_valid flag as some of the publications for the table may
be already present. We can check if the publications_valid flag is set
while trying to validate the entry. Now, even if we don't find any
publications the replicate_valid flag will be set but none of the
actions will be set, so it won't do anything in the caller. Is this
better than the previous approach?

For the record, I'm not convinced this is the right way to fix the
issue, as it may easily mask the real problem.

We do silently ignore missing objects in various places, but only when
either requested or when it's obvious it's expected and safe to ignore.
But I'm not sure that applies here, in a clean way.

Imagine you have a subscriber using two publications p1 and p2, and
someone comes around and drops p1 by mistake. With the proposed patch,
the subscription will notice this, but it'll continue sending data
ignoring the missing publication. Yes, it will continue working, but
it's quite possible this breaks the subscriber and it's be better to
fail and stop replicating.

The other aspect I dislike is that we just stop caching publication
info, forcing us to reload it for every replicated change/row. So even
if dropping the publication happens not to "break" the subscriber (i.e.
the data makes sense), this may easily cause performance issues, lag in
the replication, and so on. And the users will have no idea why and/or
how to fix it, because we just do this silently.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#14)
Re: Build-farm - intermittent error in 031_column_list.pl

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

The other aspect I dislike is that we just stop caching publication
info, forcing us to reload it for every replicated change/row.

Ouch --- that seems likely to be completely horrid. At the very least
I'd want to see some benchmark numbers before concluding we could
live with that.

regards, tom lane

#16Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#14)
Re: Build-farm - intermittent error in 031_column_list.pl

On Wed, May 25, 2022 at 6:54 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 5/25/22 13:26, Amit Kapila wrote:

On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

It does "fix" the case of [1]. But AFAIS
RelationSyncEntry.replicate_valid is only used to inhibit repeated
loading in get_rel_sync_entry and the function doesn't seem to be
assumed to return a invalid entry. (Since the flag is not checked
nowhere else.)

For example pgoutput_change does not check for the flag of the entry
returned from the function before uses it, which is not seemingly
safe. (I didn't check further, though)

Don't we need to explicitly avoid using invalid entries outside the
function?

We decide that based on pubactions in the callers, so even if entry is
valid, it won't do anything. Actually, we don't need to avoid setting
replication_valid flag as some of the publications for the table may
be already present. We can check if the publications_valid flag is set
while trying to validate the entry. Now, even if we don't find any
publications the replicate_valid flag will be set but none of the
actions will be set, so it won't do anything in the caller. Is this
better than the previous approach?

For the record, I'm not convinced this is the right way to fix the
issue, as it may easily mask the real problem.

We do silently ignore missing objects in various places, but only when
either requested or when it's obvious it's expected and safe to ignore.
But I'm not sure that applies here, in a clean way.

Imagine you have a subscriber using two publications p1 and p2, and
someone comes around and drops p1 by mistake. With the proposed patch,
the subscription will notice this, but it'll continue sending data
ignoring the missing publication. Yes, it will continue working, but
it's quite possible this breaks the subscriber and it's be better to
fail and stop replicating.

Ideally, shouldn't we disallow drop of publication in such cases where
it is part of some subscription? I know it will be tricky because some
subscriptions could be disabled.

The other aspect I dislike is that we just stop caching publication
info, forcing us to reload it for every replicated change/row. So even
if dropping the publication happens not to "break" the subscriber (i.e.
the data makes sense), this may easily cause performance issues, lag in
the replication, and so on. And the users will have no idea why and/or
how to fix it, because we just do this silently.

Yeah, this is true that if there are missing publications, it needs to
reload all the publications info again unless we build a mechanism to
change the existing cached entry by loading only required info. The
other thing we could do here is to LOG the info for missing
publications to make users aware of the fact. I think we can also
introduce a new option while defining/altering subscription to
indicate whether to continue on missing publication or not, that way
by default we will stop replication as we are doing now but users will
have a way to move replication.

BTW, what are the other options we have to fix the cases where
replication is broken (or the user has no clue on how to proceed) as
we are discussing the case here or the OP reported yet another case on
pgsql-bugs [1]/messages/by-id/CANWRaJyyD=9c1E2HdF-Tqfe7+vuCQnAkXd6+EFwxC0wM=313AA@mail.gmail.com?

[1]: /messages/by-id/CANWRaJyyD=9c1E2HdF-Tqfe7+vuCQnAkXd6+EFwxC0wM=313AA@mail.gmail.com

--
With Regards,
Amit Kapila.

#17osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#16)
RE: Build-farm - intermittent error in 031_column_list.pl

On Thursday, May 26, 2022 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 25, 2022 at 6:54 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 5/25/22 13:26, Amit Kapila wrote:

On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

It does "fix" the case of [1]. But AFAIS
RelationSyncEntry.replicate_valid is only used to inhibit repeated
loading in get_rel_sync_entry and the function doesn't seem to be
assumed to return a invalid entry. (Since the flag is not checked
nowhere else.)

For example pgoutput_change does not check for the flag of the
entry returned from the function before uses it, which is not
seemingly safe. (I didn't check further, though)

Don't we need to explicitly avoid using invalid entries outside the
function?

We decide that based on pubactions in the callers, so even if entry
is valid, it won't do anything. Actually, we don't need to avoid
setting replication_valid flag as some of the publications for the
table may be already present. We can check if the publications_valid
flag is set while trying to validate the entry. Now, even if we
don't find any publications the replicate_valid flag will be set but
none of the actions will be set, so it won't do anything in the
caller. Is this better than the previous approach?

For the record, I'm not convinced this is the right way to fix the
issue, as it may easily mask the real problem.

We do silently ignore missing objects in various places, but only when
either requested or when it's obvious it's expected and safe to ignore.
But I'm not sure that applies here, in a clean way.

Imagine you have a subscriber using two publications p1 and p2, and
someone comes around and drops p1 by mistake. With the proposed patch,
the subscription will notice this, but it'll continue sending data
ignoring the missing publication. Yes, it will continue working, but
it's quite possible this breaks the subscriber and it's be better to
fail and stop replicating.

Ideally, shouldn't we disallow drop of publication in such cases where it is part
of some subscription? I know it will be tricky because some subscriptions
could be disabled.

The other aspect I dislike is that we just stop caching publication
info, forcing us to reload it for every replicated change/row. So even
if dropping the publication happens not to "break" the subscriber (i.e.
the data makes sense), this may easily cause performance issues, lag
in the replication, and so on. And the users will have no idea why
and/or how to fix it, because we just do this silently.

Yeah, this is true that if there are missing publications, it needs to reload all the
publications info again unless we build a mechanism to change the existing
cached entry by loading only required info. The other thing we could do here is
to LOG the info for missing publications to make users aware of the fact. I think
we can also introduce a new option while defining/altering subscription to
indicate whether to continue on missing publication or not, that way by default
we will stop replication as we are doing now but users will have a way to move
replication.

BTW, what are the other options we have to fix the cases where replication is
broken (or the user has no clue on how to proceed) as we are discussing the
case here or the OP reported yet another case on pgsql-bugs [1]?

Hi,

FYI, I've noticed that after the last report by Peter-san
we've gotten the same errors on Build Farm.
We need to keep discussing to conclude this.

1. Details for system "xenodermus" failure at stage subscriptionCheck, snapshot taken 2022-05-31 13:00:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&amp;dt=2022-05-31%2013%3A00%3A04

2. Details for system "phycodurus" failure at stage subscriptionCheck, snapshot taken 2022-05-26 17:30:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&amp;dt=2022-05-26%2017%3A30%3A04

Best Regards,
Takamichi Osumi

#18Alexander Lakhin
Alexander Lakhin
exclusion@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#17)
1 attachment(s)
Re: Build-farm - intermittent error in 031_column_list.pl

Hello hackers,

01.06.2022 05:06, osumi.takamichi@fujitsu.com wrote:

FYI, I've noticed that after the last report by Peter-san
we've gotten the same errors on Build Farm.
We need to keep discussing to conclude this.

1. Details for system "xenodermus" failure at stage subscriptionCheck, snapshot taken 2022-05-31 13:00:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&amp;dt=2022-05-31%2013%3A00%3A04

2. Details for system "phycodurus" failure at stage subscriptionCheck, snapshot taken 2022-05-26 17:30:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&amp;dt=2022-05-26%2017%3A30%3A04

I think, I've discovered what causes that test failure.
When playing with bgwriter during researching [1]/messages/by-id/6f85667e-5754-5d35-dbf1-c83fe08c1e48@gmail.com, I made it more
aggressive with a dirty hack:
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 1

          rc = WaitLatch(MyLatch,
                         WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-                       BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
+                       1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN);

With this modification, I ran `make check -C src/test/subscription` in a
loop and observed the same failure of 031_column_list as discussed here.
With log_min_messages = DEBUG2, I see that in a failed case there is no
'"sub1" has now caught up' message (as Amit and Tomas guessed upthread)
(See full logs from one successful and two failed runs attached (I added
some extra logging and enabled wal_debug to better understand what's going
on here).)

If we look at the failures occurred in the buildfarm:
The first two from the past:
1)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&amp;dt=2022-05-31%2013%3A00%3A04

2022-05-26 20:39:23.828 CEST [276284][postmaster][:0][] LOG: starting PostgreSQL 15beta1 on x86_64-pc-linux-gnu,
compiled by gcc-10 (Debian 10.3.0-15) 10.3.0, 64-bit
...
2022-05-26 20:39:39.768 CEST [277545][walsender][3/0:0][sub1] ERROR:  publication "pub6" does not exist

2)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&amp;dt=2022-05-26%2017%3A30%3A04

2022-05-31 16:33:25.506 CEST [3223685][postmaster][:0][] LOG: starting PostgreSQL 15beta1 on x86_64-pc-linux-gnu,
compiled by clang version 6.0.1 , 64-bit
...
2022-05-31 16:33:41.114 CEST [3224511][walsender][3/0:0][sub1] ERROR:  publication "pub6" does not exist

The other two from the present:
3)
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&amp;dt=2023-12-14%2009%3A14%3A52&amp;stg=subscription-check

2023-12-14 09:26:12.523 UTC [1144979][postmaster][:0] LOG:  starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by
Debian clang version 13.0.1-11+b2, 64-bit
...
2023-12-14 09:26:28.663 UTC [1157936][walsender][3/0:0] ERROR: publication "pub6" does not exist

4)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&amp;dt=2023-11-17%2018%3A28%3A24

2023-11-17 18:31:13.594 UTC [200939] LOG:  starting PostgreSQL 17devel on x86_64-linux, compiled by clang-14.0.6, 64-bit
...
2023-11-17 18:31:29.292 UTC [222103] sub1 ERROR:  publication "pub6" does not exist

we can see that all the failures occurred approximately since 16 seconds
after the server start. And it's very close to predefined
LOG_SNAPSHOT_INTERVAL_MS.

[1]: /messages/by-id/6f85667e-5754-5d35-dbf1-c83fe08c1e48@gmail.com

Best regards,
Alexander

Attachments:

031_column_list-runs.7zapplication/x-7z-compressed; name=031_column_list-runs.7z