subscription/026_stats test is intermittently slow?
I just did a run of the regression test where this test was the last
one to finish by quite a lot. Key log entries:
[13:35:48.583](0.039s) # initializing database system by copying initdb template
...
[13:35:52.397](0.108s) ok 5 - Check reset timestamp for
'test_tab1_sub' is newer after second reset.
#### Begin standard error
psql:<stdin>:1: NOTICE: created replication slot "test_tab2_sub" on publisher
#### End standard error
Waiting for replication conn test_tab2_sub's replay_lsn to pass
0/151E8C8 on publisher
done
[13:38:53.706](181.310s) ok 6 - Check that table 'test_tab2' now has 1 row.
...
[13:38:54.344](0.294s) 1..13
I reran the test and it looks very different:
[13:54:01.703](0.090s) ok 5 - Check reset timestamp for
'test_tab1_sub' is newer after second reset.
...
Waiting for replication conn test_tab2_sub's replay_lsn to pass
0/151E900 on publisher
...
[13:54:03.006](1.303s) ok 6 - Check that table 'test_tab2' now has 1 row.
It looks to me like in the first run it took 3 minutes for the
replay_lsn to catch up to the desired value, and in the second run,
two seconds. I think I have seen previous instances where something
similar happened, although in those cases I did not stop to record any
details. Have others seen this? Is there something we can/should do
about it?
--
Robert Haas
EDB: http://www.enterprisedb.com
On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote:
It looks to me like in the first run it took 3 minutes for the
replay_lsn to catch up to the desired value, and in the second run,
two seconds. I think I have seen previous instances where something
similar happened, although in those cases I did not stop to record any
details. Have others seen this? Is there something we can/should do
about it?
FWIW, I've also seen delays as well with this test on a few occasions.
Thanks for mentioning it.
--
Michael
Hello Michael and Robert,
20.04.2024 05:57, Michael Paquier wrote:
On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote:
It looks to me like in the first run it took 3 minutes for the
replay_lsn to catch up to the desired value, and in the second run,
two seconds. I think I have seen previous instances where something
similar happened, although in those cases I did not stop to record any
details. Have others seen this? Is there something we can/should do
about it?FWIW, I've also seen delays as well with this test on a few occasions.
Thanks for mentioning it.
It reminds me of
/messages/by-id/858a7622-2c81-1687-d1df-1322dfcb2e72@gmail.com
At least, I could reproduce such a delay with the attached patch applied.
Best regards,
Alexander
Attachments:
026_stats-180.patchtext/x-patch; charset=UTF-8; name=026_stats-180.patchDownload
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 66070e9131..9368a3a897 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -227,6 +227,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
if (rc & WL_LATCH_SET)
{
+pg_usleep(300000);
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
}
diff --git a/src/test/subscription/t/026_stats.pl b/src/test/subscription/t/026_stats.pl
index d1d68fad9a..a82f1c051c 100644
--- a/src/test/subscription/t/026_stats.pl
+++ b/src/test/subscription/t/026_stats.pl
@@ -37,6 +37,7 @@ sub create_sub_pub_w_errors
$db,
qq[
BEGIN;
+SELECT pg_sleep(0.5);
CREATE TABLE $table_name(a int primary key);
INSERT INTO $table_name VALUES (1);
COMMIT;
On Sat, 20 Apr 2024 at 10:30, Alexander Lakhin <exclusion@gmail.com> wrote:
Hello Michael and Robert,
20.04.2024 05:57, Michael Paquier wrote:
On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote:
It looks to me like in the first run it took 3 minutes for the
replay_lsn to catch up to the desired value, and in the second run,
two seconds. I think I have seen previous instances where something
similar happened, although in those cases I did not stop to record any
details. Have others seen this? Is there something we can/should do
about it?FWIW, I've also seen delays as well with this test on a few occasions.
Thanks for mentioning it.It reminds me of
/messages/by-id/858a7622-2c81-1687-d1df-1322dfcb2e72@gmail.com
Thanks Alexander for the test, I was able to reproduce the issue with
the test you shared and also verify that the patch at [1]/messages/by-id/CALDaNm10R7L0Dxq+-J=Pp3AfM_yaokpbhECvJ69QiGH8-jQquw@mail.gmail.com fixes the
same. This is the same issue where the apply worker for test_tab2_sub
was getting started after 180 seconds because the common latch (which
is used for worker attached, subscription creation/modification and
apply worker process exit) was getting reset when the other
subscription test_tab1_sub's worker gets started. The same can be seen
from the logs:
2024-04-22 20:47:52.009 IST [323280] 026_stats.pl LOG: statement: BEGIN;
2024-04-22 20:47:52.009 IST [323280] 026_stats.pl LOG: statement:
SELECT pg_sleep(0.5);
2024-04-22 20:47:52.426 IST [323281] LOG: logical replication apply
worker for subscription "test_tab1_sub" has started
2024-04-22 20:47:52.511 IST [323280] 026_stats.pl LOG: statement:
CREATE TABLE test_tab2(a int primary key);
2024-04-22 20:47:52.518 IST [323280] 026_stats.pl LOG: statement:
INSERT INTO test_tab2 VALUES (1);
2024-04-22 20:47:52.519 IST [323280] 026_stats.pl LOG: statement: COMMIT;
2024-04-22 20:47:52.540 IST [323286] 026_stats.pl LOG: statement:
CREATE SUBSCRIPTION test_tab2_sub CONNECTION 'port=56685
host=/tmp/RwzpQrVMYH dbname=postgres' PUBLICATION test_tab2_pub
2024-04-22 20:50:52.658 IST [326265] LOG: logical replication apply
worker for subscription "test_tab2_sub" has started
2024-04-22 20:50:52.668 IST [326267] LOG: logical replication table
synchronization worker for subscription "test_tab2_sub", table
"test_tab2" has started
[1]: /messages/by-id/CALDaNm10R7L0Dxq+-J=Pp3AfM_yaokpbhECvJ69QiGH8-jQquw@mail.gmail.com
Regards,
Vignesh
Hi,
On 2024-04-19 13:57:41 -0400, Robert Haas wrote:
Have others seen this? Is there something we can/should do about it?
Yes, I've also seen this - but never quite reproducible enough to properly
tackle it.
The first thing I'd like to do is to make the wait_for_catchup routine
regularly log the current state, so we can in retrospect analyze e.g. whether
there was continual, but slow, replay progress, or whether replay was entirely
stuck. wait_for_catchup() not being debuggable has been a problem in many
different tests, so I think it's high time to fix that.
Greetings,
Andres Freund
On Tue, Apr 23, 2024 at 11:49 AM vignesh C <vignesh21@gmail.com> wrote:
On Sat, 20 Apr 2024 at 10:30, Alexander Lakhin <exclusion@gmail.com> wrote:
Hello Michael and Robert,
20.04.2024 05:57, Michael Paquier wrote:
On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote:
It looks to me like in the first run it took 3 minutes for the
replay_lsn to catch up to the desired value, and in the second run,
two seconds. I think I have seen previous instances where something
similar happened, although in those cases I did not stop to record any
details. Have others seen this? Is there something we can/should do
about it?FWIW, I've also seen delays as well with this test on a few occasions.
Thanks for mentioning it.It reminds me of
/messages/by-id/858a7622-2c81-1687-d1df-1322dfcb2e72@gmail.comThanks Alexander for the test, I was able to reproduce the issue with
the test you shared and also verify that the patch at [1] fixes the
same.
One of the issues reported in the thread you referred to has the same
symptoms [1]/messages/by-id/858a7622-2c81-1687-d1df-1322dfcb2e72@gmail.com. I'll review and analyze your proposal.
[1]: /messages/by-id/858a7622-2c81-1687-d1df-1322dfcb2e72@gmail.com
--
With Regards,
Amit Kapila.