Why is subscription/t/031_column_list.pl failing so much?

Started by Tom Laneover 2 years ago31 messageshackers

tgl@sss.pgh.pa.us

over 2 years ago

If you look at the buildfarm's failures page and filter down to
just subscriptionCheck failures, what you find is that all of the
last 6 such failures are in 031_column_list.pl:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-02-02%2019%3A33%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-02-02%2011%3A21%3A44
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-01%2020%3A34%3A29
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-01%2016%3A57%3A14
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-01-31%2022%3A18%3A24
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-01-30%2011%3A29%3A23

There are some further back too:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-17%2018%3A28%3A24

but this definitely got way more common in the last few days.

Digging down into the logs, these all look pretty similar. Somehow
things get into a state where replication connections fail with
the publisher reporting "publication does not exist":

2024-02-02 19:42:23.187 UTC [1631708][not initialized][:0] LOG: connection received: host=[local]
2024-02-02 19:42:23.189 UTC [1631708][walsender][4/287:0] LOG: connection authenticated: user="bf" method=trust (/home/bf/bf-build/tamandua/HEAD/pgsql.build/testrun/subscription/031_column_list/data/t_031_column_list_publisher_data/pgdata/pg_hba.conf:117)
2024-02-02 19:42:23.189 UTC [1631708][walsender][4/287:0] LOG: replication connection authorized: user=bf application_name=sub1
2024-02-02 19:42:23.214 UTC [1631708][walsender][4/288:0] LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG: received replication command: IDENTIFY_SYSTEM
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT: IDENTIFY_SYSTEM
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG: received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"')
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"')
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG: acquired logical replication slot "sub1"
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"')
2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] LOG: starting logical decoding for slot "sub1"
2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] DETAIL: Streaming transactions committing after 0/15BCDD0, reading WAL from 0/15BCDD0.
2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"')
2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] LOG: logical decoding found consistent point at 0/15BCDD0
2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] DETAIL: There are no running transactions.
2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"')
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] ERROR: publication "pub7" does not exist
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] CONTEXT: slot "sub1", output plugin "pgoutput", in the change callback, associated LSN 0/15C7698
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"')
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] LOG: released logical replication slot "sub1"
2024-02-02 19:42:23.834 UTC [1631708][walsender][:0] LOG: disconnection: session time: 0:00:00.647 user=bf database=postgres host=[local]

and then we just repeat that until the test times out. It fails at
different points in the test script (hence, different publication
names), but the pattern looks about the same.

I don't see anything that 031_column_list.pl is doing that is much
different from other subscription tests, so why is it the only one
failing? And more to the point, what's going wrong exactly?

I am suspicious that this somehow represents a failure of the
historical catalog decoding logic, but I don't see how that theory
explains this only breaking in one test script.

regards, tom lane