subscription/026_stats test is intermittently slow?

Started by Robert Haasover 1 year ago6 messages
#1Robert Haas
robertmhaas@gmail.com

I just did a run of the regression test where this test was the last
one to finish by quite a lot. Key log entries:

[13:35:48.583](0.039s) # initializing database system by copying initdb template
...
[13:35:52.397](0.108s) ok 5 - Check reset timestamp for
'test_tab1_sub' is newer after second reset.

#### Begin standard error

psql:<stdin>:1: NOTICE: created replication slot "test_tab2_sub" on publisher

#### End standard error

Waiting for replication conn test_tab2_sub's replay_lsn to pass
0/151E8C8 on publisher

done

[13:38:53.706](181.310s) ok 6 - Check that table 'test_tab2' now has 1 row.
...
[13:38:54.344](0.294s) 1..13

I reran the test and it looks very different:

[13:54:01.703](0.090s) ok 5 - Check reset timestamp for
'test_tab1_sub' is newer after second reset.
...
Waiting for replication conn test_tab2_sub's replay_lsn to pass
0/151E900 on publisher
...
[13:54:03.006](1.303s) ok 6 - Check that table 'test_tab2' now has 1 row.

It looks to me like in the first run it took 3 minutes for the
replay_lsn to catch up to the desired value, and in the second run,
two seconds. I think I have seen previous instances where something
similar happened, although in those cases I did not stop to record any
details. Have others seen this? Is there something we can/should do
about it?

--
Robert Haas
EDB: http://www.enterprisedb.com

#2Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#1)
Re: subscription/026_stats test is intermittently slow?

On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote:

It looks to me like in the first run it took 3 minutes for the
replay_lsn to catch up to the desired value, and in the second run,
two seconds. I think I have seen previous instances where something
similar happened, although in those cases I did not stop to record any
details. Have others seen this? Is there something we can/should do
about it?

FWIW, I've also seen delays as well with this test on a few occasions.
Thanks for mentioning it.
--
Michael

#3Alexander Lakhin
exclusion@gmail.com
In reply to: Michael Paquier (#2)
1 attachment(s)
Re: subscription/026_stats test is intermittently slow?

Hello Michael and Robert,

20.04.2024 05:57, Michael Paquier wrote:

On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote:

It looks to me like in the first run it took 3 minutes for the
replay_lsn to catch up to the desired value, and in the second run,
two seconds. I think I have seen previous instances where something
similar happened, although in those cases I did not stop to record any
details. Have others seen this? Is there something we can/should do
about it?

FWIW, I've also seen delays as well with this test on a few occasions.
Thanks for mentioning it.

It reminds me of
/messages/by-id/858a7622-2c81-1687-d1df-1322dfcb2e72@gmail.com

At least, I could reproduce such a delay with the attached patch applied.

Best regards,
Alexander

Attachments:

026_stats-180.patchtext/x-patch; charset=UTF-8; name=026_stats-180.patchDownload
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 66070e9131..9368a3a897 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -227,6 +227,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 
 		if (rc & WL_LATCH_SET)
 		{
+pg_usleep(300000);
 			ResetLatch(MyLatch);
 			CHECK_FOR_INTERRUPTS();
 		}
diff --git a/src/test/subscription/t/026_stats.pl b/src/test/subscription/t/026_stats.pl
index d1d68fad9a..a82f1c051c 100644
--- a/src/test/subscription/t/026_stats.pl
+++ b/src/test/subscription/t/026_stats.pl
@@ -37,6 +37,7 @@ sub create_sub_pub_w_errors
 		$db,
 		qq[
 	BEGIN;
+SELECT pg_sleep(0.5);
 	CREATE TABLE $table_name(a int primary key);
 	INSERT INTO $table_name VALUES (1);
 	COMMIT;
#4vignesh C
vignesh21@gmail.com
In reply to: Alexander Lakhin (#3)
Re: subscription/026_stats test is intermittently slow?

On Sat, 20 Apr 2024 at 10:30, Alexander Lakhin <exclusion@gmail.com> wrote:

Hello Michael and Robert,

20.04.2024 05:57, Michael Paquier wrote:

On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote:

It looks to me like in the first run it took 3 minutes for the
replay_lsn to catch up to the desired value, and in the second run,
two seconds. I think I have seen previous instances where something
similar happened, although in those cases I did not stop to record any
details. Have others seen this? Is there something we can/should do
about it?

FWIW, I've also seen delays as well with this test on a few occasions.
Thanks for mentioning it.

It reminds me of
/messages/by-id/858a7622-2c81-1687-d1df-1322dfcb2e72@gmail.com

Thanks Alexander for the test, I was able to reproduce the issue with
the test you shared and also verify that the patch at [1]/messages/by-id/CALDaNm10R7L0Dxq+-J=Pp3AfM_yaokpbhECvJ69QiGH8-jQquw@mail.gmail.com fixes the
same. This is the same issue where the apply worker for test_tab2_sub
was getting started after 180 seconds because the common latch (which
is used for worker attached, subscription creation/modification and
apply worker process exit) was getting reset when the other
subscription test_tab1_sub's worker gets started. The same can be seen
from the logs:
2024-04-22 20:47:52.009 IST [323280] 026_stats.pl LOG: statement: BEGIN;
2024-04-22 20:47:52.009 IST [323280] 026_stats.pl LOG: statement:
SELECT pg_sleep(0.5);
2024-04-22 20:47:52.426 IST [323281] LOG: logical replication apply
worker for subscription "test_tab1_sub" has started
2024-04-22 20:47:52.511 IST [323280] 026_stats.pl LOG: statement:
CREATE TABLE test_tab2(a int primary key);
2024-04-22 20:47:52.518 IST [323280] 026_stats.pl LOG: statement:
INSERT INTO test_tab2 VALUES (1);
2024-04-22 20:47:52.519 IST [323280] 026_stats.pl LOG: statement: COMMIT;
2024-04-22 20:47:52.540 IST [323286] 026_stats.pl LOG: statement:
CREATE SUBSCRIPTION test_tab2_sub CONNECTION 'port=56685
host=/tmp/RwzpQrVMYH dbname=postgres' PUBLICATION test_tab2_pub
2024-04-22 20:50:52.658 IST [326265] LOG: logical replication apply
worker for subscription "test_tab2_sub" has started
2024-04-22 20:50:52.668 IST [326267] LOG: logical replication table
synchronization worker for subscription "test_tab2_sub", table
"test_tab2" has started

[1]: /messages/by-id/CALDaNm10R7L0Dxq+-J=Pp3AfM_yaokpbhECvJ69QiGH8-jQquw@mail.gmail.com

Regards,
Vignesh

#5Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#1)
Re: subscription/026_stats test is intermittently slow?

Hi,

On 2024-04-19 13:57:41 -0400, Robert Haas wrote:

Have others seen this? Is there something we can/should do about it?

Yes, I've also seen this - but never quite reproducible enough to properly
tackle it.

The first thing I'd like to do is to make the wait_for_catchup routine
regularly log the current state, so we can in retrospect analyze e.g. whether
there was continual, but slow, replay progress, or whether replay was entirely
stuck. wait_for_catchup() not being debuggable has been a problem in many
different tests, so I think it's high time to fix that.

Greetings,

Andres Freund

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#4)
Re: subscription/026_stats test is intermittently slow?

On Tue, Apr 23, 2024 at 11:49 AM vignesh C <vignesh21@gmail.com> wrote:

On Sat, 20 Apr 2024 at 10:30, Alexander Lakhin <exclusion@gmail.com> wrote:

Hello Michael and Robert,

20.04.2024 05:57, Michael Paquier wrote:

On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote:

It looks to me like in the first run it took 3 minutes for the
replay_lsn to catch up to the desired value, and in the second run,
two seconds. I think I have seen previous instances where something
similar happened, although in those cases I did not stop to record any
details. Have others seen this? Is there something we can/should do
about it?

FWIW, I've also seen delays as well with this test on a few occasions.
Thanks for mentioning it.

It reminds me of
/messages/by-id/858a7622-2c81-1687-d1df-1322dfcb2e72@gmail.com

Thanks Alexander for the test, I was able to reproduce the issue with
the test you shared and also verify that the patch at [1] fixes the
same.

One of the issues reported in the thread you referred to has the same
symptoms [1]/messages/by-id/858a7622-2c81-1687-d1df-1322dfcb2e72@gmail.com. I'll review and analyze your proposal.

[1]: /messages/by-id/858a7622-2c81-1687-d1df-1322dfcb2e72@gmail.com

--
With Regards,
Amit Kapila.