Why is subscription/t/031_column_list.pl failing so much?
If you look at the buildfarm's failures page and filter down to
just subscriptionCheck failures, what you find is that all of the
last 6 such failures are in 031_column_list.pl:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-02-02%2019%3A33%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-02-02%2011%3A21%3A44
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-01%2020%3A34%3A29
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-01%2016%3A57%3A14
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-01-31%2022%3A18%3A24
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-01-30%2011%3A29%3A23
There are some further back too:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-17%2018%3A28%3A24
but this definitely got way more common in the last few days.
Digging down into the logs, these all look pretty similar. Somehow
things get into a state where replication connections fail with
the publisher reporting "publication does not exist":
2024-02-02 19:42:23.187 UTC [1631708][not initialized][:0] LOG: connection received: host=[local]
2024-02-02 19:42:23.189 UTC [1631708][walsender][4/287:0] LOG: connection authenticated: user="bf" method=trust (/home/bf/bf-build/tamandua/HEAD/pgsql.build/testrun/subscription/031_column_list/data/t_031_column_list_publisher_data/pgdata/pg_hba.conf:117)
2024-02-02 19:42:23.189 UTC [1631708][walsender][4/287:0] LOG: replication connection authorized: user=bf application_name=sub1
2024-02-02 19:42:23.214 UTC [1631708][walsender][4/288:0] LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG: received replication command: IDENTIFY_SYSTEM
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT: IDENTIFY_SYSTEM
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG: received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"')
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"')
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG: acquired logical replication slot "sub1"
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"')
2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] LOG: starting logical decoding for slot "sub1"
2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] DETAIL: Streaming transactions committing after 0/15BCDD0, reading WAL from 0/15BCDD0.
2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"')
2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] LOG: logical decoding found consistent point at 0/15BCDD0
2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] DETAIL: There are no running transactions.
2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"')
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] ERROR: publication "pub7" does not exist
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] CONTEXT: slot "sub1", output plugin "pgoutput", in the change callback, associated LSN 0/15C7698
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"')
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] LOG: released logical replication slot "sub1"
2024-02-02 19:42:23.834 UTC [1631708][walsender][:0] LOG: disconnection: session time: 0:00:00.647 user=bf database=postgres host=[local]
and then we just repeat that until the test times out. It fails at
different points in the test script (hence, different publication
names), but the pattern looks about the same.
I don't see anything that 031_column_list.pl is doing that is much
different from other subscription tests, so why is it the only one
failing? And more to the point, what's going wrong exactly?
I am suspicious that this somehow represents a failure of the
historical catalog decoding logic, but I don't see how that theory
explains this only breaking in one test script.
regards, tom lane
On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
If you look at the buildfarm's failures page and filter down to
just subscriptionCheck failures, what you find is that all of the
last 6 such failures are in 031_column_list.pl:https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-02-02%2019%3A33%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-02-02%2011%3A21%3A44
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-01%2020%3A34%3A29
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-01%2016%3A57%3A14
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-01-31%2022%3A18%3A24
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-01-30%2011%3A29%3A23There are some further back too:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-17%2018%3A28%3A24
but this definitely got way more common in the last few days.
I don't see anything that 031_column_list.pl is doing that is much
different from other subscription tests, so why is it the only one
failing? And more to the point, what's going wrong exactly?
I don't know, but
/messages/by-id/16d6d9cc-f97d-0b34-be65-425183ed3721@gmail.com
reported a replacement BgWriterDelay value reproducing it. That hasn't
reproduced it in ~10 runs on my machine, though.
Show quoted text
I am suspicious that this somehow represents a failure of the
historical catalog decoding logic, but I don't see how that theory
explains this only breaking in one test script.
On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
If you look at the buildfarm's failures page and filter down to
just subscriptionCheck failures, what you find is that all of the
last 6 such failures are in 031_column_list.pl:https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-02-02%2019%3A33%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-02-02%2011%3A21%3A44
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-01%2020%3A34%3A29
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-01%2016%3A57%3A14
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-01-31%2022%3A18%3A24
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-01-30%2011%3A29%3A23There are some further back too:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-17%2018%3A28%3A24
but this definitely got way more common in the last few days.
I don't see anything that 031_column_list.pl is doing that is much
different from other subscription tests, so why is it the only one
failing? And more to the point, what's going wrong exactly?I don't know, but
/messages/by-id/16d6d9cc-f97d-0b34-be65-425183ed3721@gmail.com
reported a replacement BgWriterDelay value reproducing it.
Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
BgWriterDelay.
That hasn't reproduced it in ~10 runs on my machine, though.
After 207 successes, it did fail once for me.
Show quoted text
I am suspicious that this somehow represents a failure of the
historical catalog decoding logic, but I don't see how that theory
explains this only breaking in one test script.
Noah Misch <noah@leadboat.com> writes:
On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
If you look at the buildfarm's failures page and filter down to
just subscriptionCheck failures, what you find is that all of the
last 6 such failures are in 031_column_list.pl:
...
I don't see anything that 031_column_list.pl is doing that is much
different from other subscription tests, so why is it the only one
failing? And more to the point, what's going wrong exactly?
I don't know, but
/messages/by-id/16d6d9cc-f97d-0b34-be65-425183ed3721@gmail.com
reported a replacement BgWriterDelay value reproducing it. That hasn't
reproduced it in ~10 runs on my machine, though.
Ah, thanks for that link. I like the theory proposed in that thread
that the walsender is starting up at an LSN somewhere before where
the publication is created. I'm tempted to add some more queries to
the test script to see if that can be proven.
I'm still wondering how come the failure seems to have suddenly gotten
way more common. The only changes that are in vaguely-related places
and fit the time frame are Amit's 732924043 and 776621a5e, but I sure
don't see a connection.
regards, tom lane
Hello Tom and Noah,
03.02.2024 04:24, Tom Lane wrote:
I'm still wondering how come the failure seems to have suddenly gotten
way more common. The only changes that are in vaguely-related places
and fit the time frame are Amit's 732924043 and 776621a5e, but I sure
don't see a connection.
I think the failure rate increased due to tamandua, calliphoridae,
flaviventris, and kestrel were switched from make to meson recently.
The last `make` builds for them:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-01-31%2016%3A51%3A31
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-01-31%2016%3A51%3A38
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-01-31%2016%3A52%3A37
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-01-31%2016%3A51%3A53
and since that switch the 031_column_list duration increased significantly,
e.g., on the last tamandua `make` run it executed for 7 seconds, but
successful `meson` runs give much longer duration:
280/283 postgresql:subscription / subscription/031_column_list OK 38.27s 36 subtests passed
280/283 postgresql:subscription / subscription/031_column_list OK 126.13s 36 subtests passed
280/283 postgresql:subscription / subscription/031_column_list OK 31.93s 36 subtests passed
279/283 postgresql:subscription / subscription/031_column_list OK 99.76s 36 subtests passed
So, looking at the tamandua's failure log, I see:
2024-02-02 19:41:19.750 UTC [1579219][postmaster][:0] LOG: starting PostgreSQL 17devel on x86_64-linux, compiled by
gcc-12.3.0, 64-bit
...
2024-02-02 19:42:19.973 UTC [1629333][client backend][4/213:0] LOG: statement: ALTER SUBSCRIPTION sub1 SET PUBLICATION pub7
2024-02-02 19:42:20.131 UTC [1625765][logical replication apply worker][3/122:0] LOG: logical replication worker for
subscription "sub1" will restart because of a parameter change
2024-02-02 19:42:20.137 UTC [1629333][client backend][:0] LOG: disconnection: session time: 0:00:00.212 user=bf
database=postgres host=[local]
2024-02-02 19:42:20.191 UTC [1629535][logical replication apply worker][3/124:0] LOG: logical replication apply worker
for subscription "sub1" has started
...
2024-02-02 19:42:20.445 UTC [1629535][logical replication apply worker][3/0:0] ERROR: could not receive data from WAL
stream: ERROR: publication "pub7" does not exist
CONTEXT: slot "sub1", output plugin "pgoutput", in the change callback, associated LSN 0/15C7698
(The interval between subscriber start and the error is ~ 4 * 15 seconds.)
Thus it still may be explained by bgwriter activity, though perhaps
autovacuum/checkpointer can add something as well.
Best regards,
Alexander
On Sat, Feb 3, 2024 at 12:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:
03.02.2024 04:24, Tom Lane wrote:
I'm still wondering how come the failure seems to have suddenly gotten
way more common. The only changes that are in vaguely-related places
and fit the time frame are Amit's 732924043 and 776621a5e, but I sure
don't see a connection.I think the failure rate increased due to tamandua, calliphoridae,
flaviventris, and kestrel were switched from make to meson recently.
If this can be reproduced frequently then we can even try to test the
patch in that thread by Osumi-San [1]/messages/by-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39@TYCPR01MB8373.jpnprd01.prod.outlook.com (I haven't tested that it
applies cleanly but shouldn't be difficult to make it work) based on
the theory that walsender is starting up at an LSN somewhere before
where the publication is created.
[1]: /messages/by-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39@TYCPR01MB8373.jpnprd01.prod.outlook.com
--
With Regards,
Amit Kapila.
Hello Amit,
05.02.2024 15:20, Amit Kapila wrote:
If this can be reproduced frequently then we can even try to test the
patch in that thread by Osumi-San [1] (I haven't tested that it
applies cleanly but shouldn't be difficult to make it work) based on
the theory that walsender is starting up at an LSN somewhere before
where the publication is created.[1] - /messages/by-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39@TYCPR01MB8373.jpnprd01.prod.outlook.com
Yes, with the aforementioned modification of bgwriter.c and when running
20 tests in parallel, I got failures on iterations 20. 3, 21 ..., but with the
updated Osumi-San's patch (which adds wait_for_catchup('sub1') before every
ALTER SUBSCRIPTION sub1 SET PUBLICATION ...) applied, 300 iterations ran
with no failures.
Best regards,
Alexander
Attachments:
v2-0001-Fix-the-intermittent-Build-farm-failures-in-031_c.patchtext/x-patch; charset=UTF-8; name=v2-0001-Fix-the-intermittent-Build-farm-failures-in-031_c.patchDownload+22-0
On Tue, Feb 6, 2024 at 8:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
05.02.2024 15:20, Amit Kapila wrote:
If this can be reproduced frequently then we can even try to test the
patch in that thread by Osumi-San [1] (I haven't tested that it
applies cleanly but shouldn't be difficult to make it work) based on
the theory that walsender is starting up at an LSN somewhere before
where the publication is created.[1] - /messages/by-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39@TYCPR01MB8373.jpnprd01.prod.outlook.com
Yes, with the aforementioned modification of bgwriter.c and when running
20 tests in parallel, I got failures on iterations 20. 3, 21 ..., but with the
updated Osumi-San's patch (which adds wait_for_catchup('sub1') before every
ALTER SUBSCRIPTION sub1 SET PUBLICATION ...) applied, 300 iterations ran
with no failures.
cool, is it possible to see whether this patch changes the runtime of
this test in any noticeable way?
--
With Regards,
Amit Kapila.
06.02.2024 09:48, Amit Kapila wrote:
cool, is it possible to see whether this patch changes the runtime of
this test in any noticeable way?
Yes, unfortunately it does.
I've measured duration of 100 tests runs without the patch (with pristine
bgwriter and with NO_TEMP_INSTALL):
real 6m46,031s
real 6m52,406s
real 6m51,014s
But with the patched test, I've got:
real 9m39,872s
real 9m40,044s
real 9m38,236s
(nearly 2 seconds increase per one test run)
Under Valgrind, the original test run takes:
Files=1, Tests=36, 334 wallclock secs ( 0.02 usr 0.00 sys + 163.14 cusr 7.98 csys = 171.14 CPU)
But the patched one:
Files=1, Tests=36, 368 wallclock secs ( 0.02 usr 0.00 sys + 182.16 cusr 8.90 csys = 191.08 CPU)
(30 seconds increase)
Maybe the more CPU-efficient solution would be disabling bgworker, as was
proposed in another discussion of tests instability:
/messages/by-id/ZaTxhjnPygOdosJ4@ip-10-97-1-34.eu-west-3.compute.internal
Though I think that devising a way to control bgwriter may take more time
than we can afford given the current 031 failure rate on the buildfarm
(17 failures for the last 3 days).
Best regards,
Alexander
On Tue, Feb 6, 2024 at 2:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:
06.02.2024 09:48, Amit Kapila wrote:
cool, is it possible to see whether this patch changes the runtime of
this test in any noticeable way?Yes, unfortunately it does.
I've measured duration of 100 tests runs without the patch (with pristine
bgwriter and with NO_TEMP_INSTALL):
real 6m46,031s
real 6m52,406s
real 6m51,014sBut with the patched test, I've got:
real 9m39,872s
real 9m40,044s
real 9m38,236s
(nearly 2 seconds increase per one test run)Under Valgrind, the original test run takes:
Files=1, Tests=36, 334 wallclock secs ( 0.02 usr 0.00 sys + 163.14 cusr 7.98 csys = 171.14 CPU)But the patched one:
Files=1, Tests=36, 368 wallclock secs ( 0.02 usr 0.00 sys + 182.16 cusr 8.90 csys = 191.08 CPU)
(30 seconds increase)
Yeah, I was worried about that. The other idea I have previously
thought was to change Alter Subscription to Drop+Create Subscription.
That should also help in bringing stability without losing any
functionality.
--
With Regards,
Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> writes:
Yeah, I was worried about that. The other idea I have previously
thought was to change Alter Subscription to Drop+Create Subscription.
That should also help in bringing stability without losing any
functionality.
Hm, why would that fix it?
More to the point, aren't these proposals just band-aids that
would stabilize the test without fixing the actual problem?
The same thing is likely to happen to people in the field,
unless we do something drastic like removing ALTER SUBSCRIPTION.
regards, tom lane
I wrote:
More to the point, aren't these proposals just band-aids that
would stabilize the test without fixing the actual problem?
The same thing is likely to happen to people in the field,
unless we do something drastic like removing ALTER SUBSCRIPTION.
I've been able to make the 031_column_list.pl failure pretty
reproducible by adding a delay in walsender, as attached.
While I'm not too familiar with this code, it definitely does appear
that the new walsender is told to start up at an LSN before the
creation of the publication, and then if it needs to decide whether
to stream a particular data change before it's reached that creation,
kaboom!
I read and understood the upthread worries about it not being
a great idea to ignore publication lookup failures, but I really
don't see that we have much choice. As an example, if a subscriber
is humming along reading publication pub1, and then someone
drops and then recreates pub1 on the publisher, I don't think that
the subscriber will be able to advance through that gap if there
are any operations within it that require deciding if they should
be streamed. (That is, contrary to Amit's expectation that
DROP/CREATE would mask the problem, I suspect it will instead turn
it into a hard failure. I've not experimented though.)
BTW, this same change breaks two other subscription tests:
015_stream.pl and 022_twophase_cascade.pl.
The symptoms are different (no "publication does not exist" errors),
so maybe these are just test problems not fundamental weaknesses.
But "replication falls over if the walsender is slow" isn't
something I'd call acceptable.
regards, tom lane
Attachments:
hack-add-delay-in-walsender-loop.patchtext/x-diff; charset=us-ascii; name=hack-add-delay-in-walsender-loop.patchDownload+2-0
On Tue, 6 Feb 2024 at 08:30, Alexander Lakhin <exclusion@gmail.com> wrote:
Hello Amit,
05.02.2024 15:20, Amit Kapila wrote:
If this can be reproduced frequently then we can even try to test the
patch in that thread by Osumi-San [1] (I haven't tested that it
applies cleanly but shouldn't be difficult to make it work) based on
the theory that walsender is starting up at an LSN somewhere before
where the publication is created.[1] - /messages/by-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39@TYCPR01MB8373.jpnprd01.prod.outlook.com
Yes, with the aforementioned modification of bgwriter.c and when running
20 tests in parallel, I got failures on iterations 20. 3, 21 ..., but with the
updated Osumi-San's patch (which adds wait_for_catchup('sub1') before every
ALTER SUBSCRIPTION sub1 SET PUBLICATION ...) applied, 300 iterations ran
with no failures.
I was able to reproduce the issue with the patch changes suggested in
bgwriter, but for me it failed in the 287th run.
Then I had run the test 1000 times with the test changes shared at [1]/messages/by-id/e6ce3cf7-4025-f129-e3ac-0f778469f720@gmail.com
and the test had passed all the 1000 times successfully.
I have measured the test execution with average of 10 runs and found
that it takes about 1.2 seconds more time to execute with the changes:
Without patch: 8.454 seconds
With test change patch: 9.672 seconds
For the test execution comparison I had used a machine which has total
Memory of 755.536 GB, 120 CPUs and RHEL 7 Operating System.
[1]: /messages/by-id/e6ce3cf7-4025-f129-e3ac-0f778469f720@gmail.com
Regards,
Vignesh
On Tue, Feb 6, 2024 at 8:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Amit Kapila <amit.kapila16@gmail.com> writes:
Yeah, I was worried about that. The other idea I have previously
thought was to change Alter Subscription to Drop+Create Subscription.
That should also help in bringing stability without losing any
functionality.Hm, why would that fix it?
Because for new subscriptions, we will start reading WAL from the
latest WAL insert pointer on the publisher which will be after the
point where publication is created.
More to the point, aren't these proposals just band-aids that
would stabilize the test without fixing the actual problem?
Yes, but OTOH, this behavior has been since the beginning of logical
replication. This particular test has just exposed it, so keeping BF
failing for this particular test doesn't sound like the best way to
remember it.
--
With Regards,
Amit Kapila.
On Wed, Feb 7, 2024 at 2:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
More to the point, aren't these proposals just band-aids that
would stabilize the test without fixing the actual problem?
The same thing is likely to happen to people in the field,
unless we do something drastic like removing ALTER SUBSCRIPTION.I've been able to make the 031_column_list.pl failure pretty
reproducible by adding a delay in walsender, as attached.While I'm not too familiar with this code, it definitely does appear
that the new walsender is told to start up at an LSN before the
creation of the publication, and then if it needs to decide whether
to stream a particular data change before it's reached that creation,
kaboom!I read and understood the upthread worries about it not being
a great idea to ignore publication lookup failures, but I really
don't see that we have much choice. As an example, if a subscriber
is humming along reading publication pub1, and then someone
drops and then recreates pub1 on the publisher, I don't think that
the subscriber will be able to advance through that gap if there
are any operations within it that require deciding if they should
be streamed.
Right. One idea to address those worries was to have a new
subscription option like ignore_nonexistant_pubs (or some better name
for such an option). The 'true' value of this new option means that we
will ignore the publication lookup failures and continue replication,
the 'false' means give an error as we are doing now. If we agree that
such an option is useful or at least saves us in some cases as
discussed in another thread [1], we can keep the default value as true
so that users don't face such errors by default and also have a way to
go back to current behavior.
(That is, contrary to Amit's expectation that
DROP/CREATE would mask the problem, I suspect it will instead turn
it into a hard failure. I've not experimented though.)
This is not contrary because I was suggesting to DROP/CREATE
Subscription whereas you are talking of drop and recreate of
Publication.
BTW, this same change breaks two other subscription tests:
015_stream.pl and 022_twophase_cascade.pl.
The symptoms are different (no "publication does not exist" errors),
so maybe these are just test problems not fundamental weaknesses.
As per the initial analysis, this is because those cases have somewhat
larger transactions (more than 64kB) under test so it just times out
waiting for all the data to be replicated. We will do further analysis
and share the findings.
--
With Regards,
Amit Kapila.
On Wed, 7 Feb 2024 at 15:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Feb 6, 2024 at 8:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Amit Kapila <amit.kapila16@gmail.com> writes:
Yeah, I was worried about that. The other idea I have previously
thought was to change Alter Subscription to Drop+Create Subscription.
That should also help in bringing stability without losing any
functionality.Hm, why would that fix it?
Because for new subscriptions, we will start reading WAL from the
latest WAL insert pointer on the publisher which will be after the
point where publication is created.
I was able to reproduce the issue consistently with the changes shared
by Tom Lane at [1]/messages/by-id/631312.1707251789@sss.pgh.pa.us.
I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE
SUBSCRIPTION and verified that the test has passed consistently for
50 runs that I ran. Also the test execution time increased for this
case is very negligibly:
Without patch: 7.991 seconds
With test change patch: 8.121 seconds
The test changes for the same are attached.
[1]: /messages/by-id/631312.1707251789@sss.pgh.pa.us
Regards,
Vignesh
Attachments:
v3-0001-Fix-the-intermittent-Build-farm-failures-in-031_c.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Fix-the-intermittent-Build-farm-failures-in-031_c.patchDownload+28-16
On Wed, 7 Feb 2024 at 15:26, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Feb 7, 2024 at 2:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
More to the point, aren't these proposals just band-aids that
would stabilize the test without fixing the actual problem?
The same thing is likely to happen to people in the field,
unless we do something drastic like removing ALTER SUBSCRIPTION.I've been able to make the 031_column_list.pl failure pretty
reproducible by adding a delay in walsender, as attached.While I'm not too familiar with this code, it definitely does appear
that the new walsender is told to start up at an LSN before the
creation of the publication, and then if it needs to decide whether
to stream a particular data change before it's reached that creation,
kaboom!I read and understood the upthread worries about it not being
a great idea to ignore publication lookup failures, but I really
don't see that we have much choice. As an example, if a subscriber
is humming along reading publication pub1, and then someone
drops and then recreates pub1 on the publisher, I don't think that
the subscriber will be able to advance through that gap if there
are any operations within it that require deciding if they should
be streamed.Right. One idea to address those worries was to have a new
subscription option like ignore_nonexistant_pubs (or some better name
for such an option). The 'true' value of this new option means that we
will ignore the publication lookup failures and continue replication,
the 'false' means give an error as we are doing now. If we agree that
such an option is useful or at least saves us in some cases as
discussed in another thread [1], we can keep the default value as true
so that users don't face such errors by default and also have a way to
go back to current behavior.(That is, contrary to Amit's expectation that
DROP/CREATE would mask the problem, I suspect it will instead turn
it into a hard failure. I've not experimented though.)This is not contrary because I was suggesting to DROP/CREATE
Subscription whereas you are talking of drop and recreate of
Publication.BTW, this same change breaks two other subscription tests:
015_stream.pl and 022_twophase_cascade.pl.
The symptoms are different (no "publication does not exist" errors),
so maybe these are just test problems not fundamental weaknesses.As per the initial analysis, this is because those cases have somewhat
larger transactions (more than 64kB) under test so it just times out
waiting for all the data to be replicated. We will do further analysis
and share the findings.
Yes, these tests are failing while waiting to catchup the larger
transactions to be replicated within180 seconds, as the transactions
needs more time to replicate because of the sleep added. To verify
this I had tried a couple of things a) I had increased the timeout to
a higher value and verified that both the test runs successfully with
1800 seconds timeout. b) I reduced the sleep to 1000 microseconds and
verified that both the test runs successfully.
So I feel these tests 015_stream.pl and 022_twophase_cascade.pl
failing after the sleep added can be ignored.
Regards,
Vignesh
Coincidentally, one of my buildfarm animals hanged several weeks in a
different test, 035_standby_logical_decoding.pl. A LOG_SNAPSHOT_INTERVAL_MS
reduction was part of making it reproducible:
On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote:
On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
If you look at the buildfarm's failures page and filter down to
just subscriptionCheck failures, what you find is that all of the
last 6 such failures are in 031_column_list.pl:
/messages/by-id/16d6d9cc-f97d-0b34-be65-425183ed3721@gmail.com
reported a replacement BgWriterDelay value reproducing it.Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
BgWriterDelay.
I'm reusing this thread just in case there's overlap with the
031_column_list.pl cause and fix. The 035_standby_logical_decoding.pl hang is
a race condition arising from an event sequence like this:
- Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
- Test script calls pg_log_standby_snapshot() on primary. Emits XLOG_RUNNING_XACTS.
- checkpoint_timeout makes a primary checkpoint finish. Emits XLOG_RUNNING_XACTS.
- bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic. Emits XLOG_RUNNING_XACTS.
- CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.
Other test code already has a solution for this, so the attached patches add a
timeout and copy the existing solution. I'm also attaching the hack that
makes it 100% reproducible.
On Wed, 7 Feb 2024 at 16:27, vignesh C <vignesh21@gmail.com> wrote:
On Wed, 7 Feb 2024 at 15:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Feb 6, 2024 at 8:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Amit Kapila <amit.kapila16@gmail.com> writes:
Yeah, I was worried about that. The other idea I have previously
thought was to change Alter Subscription to Drop+Create Subscription.
That should also help in bringing stability without losing any
functionality.Hm, why would that fix it?
Because for new subscriptions, we will start reading WAL from the
latest WAL insert pointer on the publisher which will be after the
point where publication is created.I was able to reproduce the issue consistently with the changes shared
by Tom Lane at [1].
I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE
SUBSCRIPTION and verified that the test has passed consistently for50 runs that I ran. Also the test execution time increased for this
case is very negligibly:
Without patch: 7.991 seconds
With test change patch: 8.121 secondsThe test changes for the same are attached.
Alternative, this could also be fixed like the changes proposed by
Amit at [1]/messages/by-id/CAA4eK1+T-ETXeRM4DHWzGxBpKafLCp__5bPA_QZfFQp7-0wj4Q@mail.gmail.com. In this case we ignore publications that are not found
for the purpose of computing RelSyncEntry attributes. We won't mark
such an entry as valid till all the publications are loaded without
anything missing. This means we won't publish operations on tables
corresponding to that publication till we found such a publication and
that seems okay.
Tomas had raised a performance issue forcing us to reload it for every
replicated change/row in case the publications are invalid at [2]/messages/by-id/dc08add3-10a8-738b-983a-191c7406707b@enterprisedb.com. How
about keeping the default option as it is and providing a new option
skip_not_exist_publication while creating/altering a subscription. In
this case if skip_not_exist_publication is specified we will ignore
the case if publication is not present and publications will be kept
in invalid and get validated later.
The attached patch has the changes for the same. Thoughts?
[1]: /messages/by-id/CAA4eK1+T-ETXeRM4DHWzGxBpKafLCp__5bPA_QZfFQp7-0wj4Q@mail.gmail.com
[2]: /messages/by-id/dc08add3-10a8-738b-983a-191c7406707b@enterprisedb.com
Regards,
Vignesh
Attachments:
Column_list_failure_fix_by_skip_not_exist_publication_option.patchtext/x-patch; charset=US-ASCII; name=Column_list_failure_fix_by_skip_not_exist_publication_option.patchDownload+99-20
Hi,
On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote:
Coincidentally, one of my buildfarm animals hanged several weeks in a
different test, 035_standby_logical_decoding.pl. A LOG_SNAPSHOT_INTERVAL_MS
reduction was part of making it reproducible:On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote:
On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
If you look at the buildfarm's failures page and filter down to
just subscriptionCheck failures, what you find is that all of the
last 6 such failures are in 031_column_list.pl:/messages/by-id/16d6d9cc-f97d-0b34-be65-425183ed3721@gmail.com
reported a replacement BgWriterDelay value reproducing it.Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
BgWriterDelay.I'm reusing this thread just in case there's overlap with the
031_column_list.pl cause and fix. The 035_standby_logical_decoding.pl hang is
a race condition arising from an event sequence like this:- Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
- Test script calls pg_log_standby_snapshot() on primary. Emits XLOG_RUNNING_XACTS.
- checkpoint_timeout makes a primary checkpoint finish. Emits XLOG_RUNNING_XACTS.
- bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic. Emits XLOG_RUNNING_XACTS.
- CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.Other test code already has a solution for this, so the attached patches add a
timeout and copy the existing solution. I'm also attaching the hack that
makes it 100% reproducible.
Thanks!
I did a few tests and confirm that the proposed solution fixes the corner case.
standby-slot-test-1-timeout-v1.patch LGTM.
Regarding standby-slot-test-2-race-v1.patch:
+# See corresponding create_logical_slot_on_standby() code. +$node_standby->poll_query_until( + 'postgres', qq[ + SELECT restart_lsn IS NOT NULL + FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' + ]) or die "timed out waiting for logical slot to calculate its restart_lsn"; +
What about creating a sub, say wait_for_restart_lsn_calculation() in Cluster.pm
and then make use of it in create_logical_slot_on_standby() and above? (something
like wait_for_restart_lsn_calculation-v1.patch attached).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com