subscriptioncheck failure

Started by Amit Kapilaover 4 years ago13 messages
#1Amit Kapila
amit.kapila16@gmail.com

I noticed $SUBJECT in curculio.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-05-11%2018%3A30%3A23

The failure test report:
t/020_messages.pl (Wstat: 7424 Tests: 1 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 5 tests but ran 1.
Files=21, Tests=149, 79 wallclock secs ( 0.03 usr 0.03 sys + 22.36
cusr 9.47 csys = 31.89 CPU)
Result: FAIL
Makefile:21: recipe for target 'check' failed
gmake: *** [check] Error 1

As far as I can understand the failure happens due to error:
2021-05-11 20:41:38.356 CEST [31177:4] 020_messages.pl ERROR:
replication slot "tap_sub" is active for PID 7354
2021-05-11 20:41:38.356 CEST [31177:5] 020_messages.pl STATEMENT:
SELECT get_byte(data, 1), encode(substr(data, 11, 8), 'escape')
FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
'proto_version', '1',
'publication_names', 'tap_pub',
'messages', 'true')
OFFSET 1 LIMIT 1

As per my initial analysis, this happens because in the test after
creating a subscription, we are not waiting for the subscriber to
catch up. I think there is a narrow window where it is possible that
after creating a subscription, the apply worker starts replication
even after we have done Alter Subscription .. DISABLE. This is also
visible in buildfarm logs:

2021-05-11 20:41:38.345 CEST [4046:9] 020_messages.pl STATEMENT:
SELECT get_byte(data, 0)
FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
'proto_version', '1',
'publication_names', 'tap_pub',
'messages', 'true')
...

2021-05-11 20:41:38.354 CEST [7354:6] tap_sub LOG: received
replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0
(proto_version '2', publication_names '"tap_pub"')

Notice that we have checked the messages before receiving the start
replication command from the subscriber. Now, if we would have waited
for the subscription to catch up then our check in pg_stat_replication
wouldn't have given a false positive. I have noticed that we do have
such wait in other subscription tests but somehow missed it here.

Thoughts?

--
With Regards,
Amit Kapila.

#2vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#1)
1 attachment(s)
Re: subscriptioncheck failure

On Thu, May 13, 2021 at 3:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I noticed $SUBJECT in curculio.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&amp;dt=2021-05-11%2018%3A30%3A23

The failure test report:
t/020_messages.pl (Wstat: 7424 Tests: 1 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 5 tests but ran 1.
Files=21, Tests=149, 79 wallclock secs ( 0.03 usr 0.03 sys + 22.36
cusr 9.47 csys = 31.89 CPU)
Result: FAIL
Makefile:21: recipe for target 'check' failed
gmake: *** [check] Error 1

As far as I can understand the failure happens due to error:
2021-05-11 20:41:38.356 CEST [31177:4] 020_messages.pl ERROR:
replication slot "tap_sub" is active for PID 7354
2021-05-11 20:41:38.356 CEST [31177:5] 020_messages.pl STATEMENT:
SELECT get_byte(data, 1), encode(substr(data, 11, 8), 'escape')
FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
'proto_version', '1',
'publication_names', 'tap_pub',
'messages', 'true')
OFFSET 1 LIMIT 1

As per my initial analysis, this happens because in the test after
creating a subscription, we are not waiting for the subscriber to
catch up. I think there is a narrow window where it is possible that
after creating a subscription, the apply worker starts replication
even after we have done Alter Subscription .. DISABLE. This is also
visible in buildfarm logs:

2021-05-11 20:41:38.345 CEST [4046:9] 020_messages.pl STATEMENT:
SELECT get_byte(data, 0)
FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
'proto_version', '1',
'publication_names', 'tap_pub',
'messages', 'true')
...

2021-05-11 20:41:38.354 CEST [7354:6] tap_sub LOG: received
replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0
(proto_version '2', publication_names '"tap_pub"')

Notice that we have checked the messages before receiving the start
replication command from the subscriber. Now, if we would have waited
for the subscription to catch up then our check in pg_stat_replication
wouldn't have given a false positive. I have noticed that we do have
such wait in other subscription tests but somehow missed it here.

I agree with your analysis, apart from this issue there is one other
small problem. Currently after disabling subscription
pg_stat_replication is used to verify if the walsender is exited. The
steps of the walsender exit includes cleaning up of walsender and then
releasing a replication slot. There will be random test failure if
pg_logical_slot_peek_binary_changes is called in this narrow time
window. I think it is better to fix it by checking the active column
in pg_replication_slot instead of pg_stat_replication which is more
reliable.
Attached patch which has the fixes for the same.
Thoughts?

Regards,
Vignesh

Attachments:

v1-0001-Messages-test-failure-fix.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Messages-test-failure-fix.patchDownload
From 5cbfa5614db18925ca92c5e2c7310ff6520b8397 Mon Sep 17 00:00:00 2001
From: vignesh <vignesh21@gmail.com>
Date: Thu, 13 May 2021 16:00:43 +0530
Subject: [PATCH v1] Messages test failure fix.

This test was failing because there was no wait for catch up after
creating subscription. After disabling subscription pg_stat_replication
was used to verify if the walsender is exited. The steps of walsender
exit includes cleaning up of walsendr and then releasing replication
slot. There will be random test failure if pg_logical_slot_peek_binary_changes
is called in the narrow time window. Fixed it by checking
active column in pg_replication_slot instead of pg_stat_replication
which is more reliable.
---
 src/test/subscription/t/020_messages.pl | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index e5d48ec8a0..7b385b48e3 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -34,12 +34,14 @@ $node_subscriber->safe_psql('postgres',
 	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
 );
 
+$node_publisher->wait_for_catchup('tap_sub');
+
 # Ensure a transactional logical decoding message shows up on the slot
 $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
 
 # wait for the replication connection to drop from the publisher
 $node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
 
 $node_publisher->safe_psql('postgres',
 	"SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')"
@@ -94,7 +96,7 @@ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
 
 # wait for the replication connection to drop from the publisher
 $node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
 
 $node_publisher->safe_psql('postgres', "INSERT INTO tab_test VALUES (1)");
 
@@ -122,7 +124,7 @@ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
 
 # wait for the replication connection to drop from the publisher
 $node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
 
 # Ensure a non-transactional logical decoding message shows up on the slot when
 # it is emitted within an aborted transaction. The message won't emit until
-- 
2.25.1

#3Michael Paquier
michael@paquier.xyz
In reply to: vignesh C (#2)
Re: subscriptioncheck failure

On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote:

+$node_publisher->wait_for_catchup('tap_sub');
+
# Ensure a transactional logical decoding message shows up on the slot
$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
# wait for the replication connection to drop from the publisher
$node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);

There are three places in this test where a slot is disabled, followed
by a wait to make sure that the slot is gone. Perhaps it would be
better to wrap that in a small-ish routine?
--
Michael

#4vignesh C
vignesh21@gmail.com
In reply to: Michael Paquier (#3)
Re: subscriptioncheck failure

On Thu, May 13, 2021 at 4:41 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote:

+$node_publisher->wait_for_catchup('tap_sub');
+
# Ensure a transactional logical decoding message shows up on the slot
$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
# wait for the replication connection to drop from the publisher
$node_publisher->poll_query_until('postgres',
-     'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+     "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);

There are three places in this test where a slot is disabled, followed
by a wait to make sure that the slot is gone. Perhaps it would be
better to wrap that in a small-ish routine?

Yes that would be better, I will make the changes for this and post a patch.

Regards,
Vignesh

#5vignesh C
vignesh21@gmail.com
In reply to: Michael Paquier (#3)
1 attachment(s)
Re: subscriptioncheck failure

On Thu, May 13, 2021 at 4:41 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote:

+$node_publisher->wait_for_catchup('tap_sub');
+
# Ensure a transactional logical decoding message shows up on the slot
$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
# wait for the replication connection to drop from the publisher
$node_publisher->poll_query_until('postgres',
-     'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+     "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);

There are three places in this test where a slot is disabled, followed
by a wait to make sure that the slot is gone. Perhaps it would be
better to wrap that in a small-ish routine?

Thanks for the comments, Please find the attached patch having the changes.

Regards,
Vignesh

Attachments:

v2-0001-Messages-test-failure-fix.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Messages-test-failure-fix.patchDownload
From 495b48981f18a5e05453cefeb80838fb60982e0f Mon Sep 17 00:00:00 2001
From: vignesh <vignesh21@gmail.com>
Date: Thu, 13 May 2021 16:00:43 +0530
Subject: [PATCH v2] Messages test failure fix.

This test was failing because there was no wait for catch up after
creating subscription. After disabling subscription pg_stat_replication
was used to verify if the walsender is exited. The steps of walsender
exit includes cleaning up of walsendr and then releasing replication
slot. There will be random test failure if pg_logical_slot_peek_binary_changes
is called in the narrow time window. Fixed it by checking
active column in pg_replication_slot instead of pg_stat_replication
which is more reliable.
---
 src/test/subscription/t/020_messages.pl | 60 +++++++++++++------------
 1 file changed, 32 insertions(+), 28 deletions(-)

diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 0940d0f45f..f1b2642640 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -8,6 +8,35 @@ use PostgresNode;
 use TestLib;
 use Test::More tests => 5;
 
+# Setup the subscription before checking pg_logical_slot_peek_binary_changes
+sub setup_subscription
+{
+	my $node_publisher = shift;
+	my $node_subscriber = shift;
+	my $create_subscription = shift;
+
+	if ($create_subscription == 1)
+	{
+		my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+		$node_subscriber->safe_psql('postgres',
+			"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub");
+	}
+	else
+	{
+		$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+	}
+
+	$node_publisher->wait_for_catchup('tap_sub');
+
+	# Ensure a transactional logical decoding message shows up on the slot
+	$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+	# wait for the replication connection to drop from the publisher
+	$node_publisher->poll_query_until('postgres',
+		"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+	return;
+}
+
 # Create publisher node
 my $node_publisher = get_new_node('publisher');
 $node_publisher->init(allows_streaming => 'logical');
@@ -27,20 +56,10 @@ $node_subscriber->safe_psql('postgres',
 	"CREATE TABLE tab_test (a int primary key)");
 
 # Setup logical replication
-my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR TABLE tab_test");
 
-$node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
-);
-
-# Ensure a transactional logical decoding message shows up on the slot
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
-
-# wait for the replication connection to drop from the publisher
-$node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+setup_subscription($node_publisher, $node_subscriber, 1);
 
 $node_publisher->safe_psql('postgres',
 	"SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')"
@@ -88,15 +107,7 @@ is( $result, qq(66
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
-$node_publisher->wait_for_catchup('tap_sub');
-
-# ensure a non-transactional logical decoding message shows up on the slot
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
-
-# wait for the replication connection to drop from the publisher
-$node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+setup_subscription($node_publisher, $node_subscriber, 0);
 
 $node_publisher->safe_psql('postgres', "INSERT INTO tab_test VALUES (1)");
 
@@ -118,14 +129,7 @@ $result = $node_publisher->safe_psql(
 
 is($result, qq(77|0), 'non-transactional message on slot is M');
 
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
-$node_publisher->wait_for_catchup('tap_sub');
-
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
-
-# wait for the replication connection to drop from the publisher
-$node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+setup_subscription($node_publisher, $node_subscriber, 0);
 
 # Ensure a non-transactional logical decoding message shows up on the slot when
 # it is emitted within an aborted transaction. The message won't emit until
-- 
2.25.1

#6Michael Paquier
michael@paquier.xyz
In reply to: vignesh C (#5)
Re: subscriptioncheck failure

On Thu, May 13, 2021 at 07:05:55PM +0530, vignesh C wrote:

Thanks for the comments, Please find the attached patch having the changes.

Cool, thanks for the new version. I have spent some time
understanding the initial report from Amit as well as what you are
proposing here, and refactoring the test so as the set of CREATE/ALTER
SUBSCRIPTION commands are added within this routine is a good idea.

I would have made the comment on top of setup_subscription a bit more
talkative regarding the fact that it may reuse an existing
subscription, but that's a nit. Let's wait for Amit and see what he
thinks about what you are proposing.
--
Michael

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#5)
Re: subscriptioncheck failure

On Thu, May 13, 2021 at 7:06 PM vignesh C <vignesh21@gmail.com> wrote:

On Thu, May 13, 2021 at 4:41 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote:

+$node_publisher->wait_for_catchup('tap_sub');
+
# Ensure a transactional logical decoding message shows up on the slot
$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
# wait for the replication connection to drop from the publisher
$node_publisher->poll_query_until('postgres',
-     'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+     "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);

There are three places in this test where a slot is disabled, followed
by a wait to make sure that the slot is gone. Perhaps it would be
better to wrap that in a small-ish routine?

Thanks for the comments, Please find the attached patch having the changes.

Few comments:
1.
+ # Ensure a transactional logical decoding message shows up on the slot
+ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");

After you have encapsulated this command in the function, the above
comment doesn't make sense because we do this for both transactional
and non-transactional messages. I suggest we can change it to
something like: "This is done to ensure a logical decoding message is
shown up on the slot".

2.
+# Setup the subscription before checking pg_logical_slot_peek_binary_changes
+sub setup_subscription

I think here the functionality is more for the catchup of
subscription, so it might be better to name the function as
subscription_catchup or catchup_subscription. I think you can expand
the comments atop this function a bit as suggested by Michael.

--
With Regards,
Amit Kapila.

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#7)
Re: subscriptioncheck failure

On Mon, May 17, 2021 at 9:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 13, 2021 at 7:06 PM vignesh C <vignesh21@gmail.com> wrote:

On Thu, May 13, 2021 at 4:41 PM Michael Paquier <michael@paquier.xyz> wrote:

Few comments:
1.
+ # Ensure a transactional logical decoding message shows up on the slot
+ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");

After you have encapsulated this command in the function, the above
comment doesn't make sense because we do this for both transactional
and non-transactional messages. I suggest we can change it to
something like: "This is done to ensure a logical decoding message is
shown up on the slot".

2.
+# Setup the subscription before checking pg_logical_slot_peek_binary_changes
+sub setup_subscription

I think here the functionality is more for the catchup of
subscription, so it might be better to name the function as
subscription_catchup or catchup_subscription. I think you can expand
the comments atop this function a bit as suggested by Michael.

One more point:
+ $node_publisher->wait_for_catchup('tap_sub');
+
+ # Ensure a transactional logical decoding message shows up on the slot
+ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+ # wait for the replication connection to drop from the publisher
+ $node_publisher->poll_query_until('postgres',
+ "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE
slot_name = 'tap_sub' AND active='f'", 1);

In the above sequence, wait_for_catchup will query pg_stat_replication
whereas after disabling subscription we are checking
pg_replication_slots. I understand from your explanation why we can't
rely on pg_stat_replication after DISABLE but it might be better to
check that the slot is active before disabling it. I think currently,
the test assumes that, isn't it better to have an explicit check for
that?

--
With Regards,
Amit Kapila.

#9vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#8)
1 attachment(s)
Re: subscriptioncheck failure

On Mon, May 17, 2021 at 10:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 17, 2021 at 9:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 13, 2021 at 7:06 PM vignesh C <vignesh21@gmail.com> wrote:

On Thu, May 13, 2021 at 4:41 PM Michael Paquier <michael@paquier.xyz> wrote:

Few comments:
1.
+ # Ensure a transactional logical decoding message shows up on the slot
+ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");

After you have encapsulated this command in the function, the above
comment doesn't make sense because we do this for both transactional
and non-transactional messages. I suggest we can change it to
something like: "This is done to ensure a logical decoding message is
shown up on the slot".

2.
+# Setup the subscription before checking pg_logical_slot_peek_binary_changes
+sub setup_subscription

I think here the functionality is more for the catchup of
subscription, so it might be better to name the function as
subscription_catchup or catchup_subscription. I think you can expand
the comments atop this function a bit as suggested by Michael.

One more point:
+ $node_publisher->wait_for_catchup('tap_sub');
+
+ # Ensure a transactional logical decoding message shows up on the slot
+ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+ # wait for the replication connection to drop from the publisher
+ $node_publisher->poll_query_until('postgres',
+ "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE
slot_name = 'tap_sub' AND active='f'", 1);

In the above sequence, wait_for_catchup will query pg_stat_replication
whereas after disabling subscription we are checking
pg_replication_slots. I understand from your explanation why we can't
rely on pg_stat_replication after DISABLE but it might be better to
check that the slot is active before disabling it. I think currently,
the test assumes that, isn't it better to have an explicit check for
that?

I felt this is not required, wait_for_catchup will poll_query_until
the state = 'streaming', even if START_REPLICATION takes time, state
will be in 'startup' state, this way poll_query_until will take care
of handling this.

On further analysis I found that we need to do "Alter subscription
tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple
time, Instead we can change pg_logical_slot_peek_binary_changes to
pg_logical_slot_get_binary_changes at appropriate steps. This way the
common function can be removed and the enable/disable multiple times
can be removed. If we are going ahead with this approach the above
comments provided are no more valid.

I have made the changes in similar lines in the attached patch. If you
are ok we can go ahead with the new approach which will simplify the
changes required.
Thoughts?

Regards,
Vignesh

Attachments:

v3-0001-Messages-test-failure-fix.patchapplication/x-patch; name=v3-0001-Messages-test-failure-fix.patchDownload
From 33fdfd312a9b652a40891c09dc9e075d2dd41e44 Mon Sep 17 00:00:00 2001
From: vignesh <vignesh21@gmail.com>
Date: Mon, 17 May 2021 17:23:35 +0530
Subject: [PATCH v3] Messages test failure fix.

This test was failing because there was no wait for catch up after
creating subscription. After disabling subscription pg_stat_replication
was used to verify if the walsender is exited. The steps of walsender
exit includes cleaning up of walsendr and then releasing replication
slot. There will be random test failure if pg_logical_slot_peek_binary_changes
is called in the narrow time window. Fixed it by checking
active column in pg_replication_slot instead of pg_stat_replication
which is more reliable.
---
 src/test/subscription/t/020_messages.pl | 27 +++++--------------------
 1 file changed, 5 insertions(+), 22 deletions(-)

diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 0940d0f45f..6a76d96d12 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -35,12 +35,14 @@ $node_subscriber->safe_psql('postgres',
 	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
 );
 
+$node_publisher->wait_for_catchup('tap_sub');
+
 # Ensure a transactional logical decoding message shows up on the slot
 $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
 
 # wait for the replication connection to drop from the publisher
 $node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
 
 $node_publisher->safe_psql('postgres',
 	"SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')"
@@ -77,7 +79,7 @@ is($result, qq(1|pgoutput),
 $result = $node_publisher->safe_psql(
 	'postgres', qq(
 		SELECT get_byte(data, 0)
-		FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
+		FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
 			'proto_version', '1',
 			'publication_names', 'tap_pub')
 ));
@@ -88,16 +90,6 @@ is( $result, qq(66
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
-$node_publisher->wait_for_catchup('tap_sub');
-
-# ensure a non-transactional logical decoding message shows up on the slot
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
-
-# wait for the replication connection to drop from the publisher
-$node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
-
 $node_publisher->safe_psql('postgres', "INSERT INTO tab_test VALUES (1)");
 
 my $message_lsn = $node_publisher->safe_psql('postgres',
@@ -109,7 +101,7 @@ $node_publisher->safe_psql('postgres', "INSERT INTO tab_test VALUES (2)");
 $result = $node_publisher->safe_psql(
 	'postgres', qq(
 		SELECT get_byte(data, 0), get_byte(data, 1)
-		FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
+		FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
 			'proto_version', '1',
 			'publication_names', 'tap_pub',
 			'messages', 'true')
@@ -118,15 +110,6 @@ $result = $node_publisher->safe_psql(
 
 is($result, qq(77|0), 'non-transactional message on slot is M');
 
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
-$node_publisher->wait_for_catchup('tap_sub');
-
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
-
-# wait for the replication connection to drop from the publisher
-$node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
-
 # Ensure a non-transactional logical decoding message shows up on the slot when
 # it is emitted within an aborted transaction. The message won't emit until
 # something advances the LSN, hence, we intentionally forces the server to
-- 
2.25.1

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#9)
Re: subscriptioncheck failure

On Mon, May 17, 2021 at 5:48 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, May 17, 2021 at 10:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

One more point:
+ $node_publisher->wait_for_catchup('tap_sub');
+
+ # Ensure a transactional logical decoding message shows up on the slot
+ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+ # wait for the replication connection to drop from the publisher
+ $node_publisher->poll_query_until('postgres',
+ "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE
slot_name = 'tap_sub' AND active='f'", 1);

In the above sequence, wait_for_catchup will query pg_stat_replication
whereas after disabling subscription we are checking
pg_replication_slots. I understand from your explanation why we can't
rely on pg_stat_replication after DISABLE but it might be better to
check that the slot is active before disabling it. I think currently,
the test assumes that, isn't it better to have an explicit check for
that?

I felt this is not required, wait_for_catchup will poll_query_until
the state = 'streaming', even if START_REPLICATION takes time, state
will be in 'startup' state, this way poll_query_until will take care
of handling this.

makes sense, but let's add some comments to clarify the same.

On further analysis I found that we need to do "Alter subscription
tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple
time, Instead we can change pg_logical_slot_peek_binary_changes to
pg_logical_slot_get_binary_changes at appropriate steps. This way the
common function can be removed and the enable/disable multiple times
can be removed.

I think that is a valid point. This was probably kept so that we can
peek multiple times for the same message to test various things but
that can be achieved with the way you have changed the test.

One more thing, shouldn't we make auto_vacuum=off for this test by
using 'append_conf' before starting the publisher. That will avoid the
risk of empty transactions.

--
With Regards,
Amit Kapila.

#11vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#10)
1 attachment(s)
Re: subscriptioncheck failure

On Tue, May 18, 2021 at 9:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 17, 2021 at 5:48 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, May 17, 2021 at 10:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

One more point:
+ $node_publisher->wait_for_catchup('tap_sub');
+
+ # Ensure a transactional logical decoding message shows up on the slot
+ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+ # wait for the replication connection to drop from the publisher
+ $node_publisher->poll_query_until('postgres',
+ "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE
slot_name = 'tap_sub' AND active='f'", 1);

In the above sequence, wait_for_catchup will query pg_stat_replication
whereas after disabling subscription we are checking
pg_replication_slots. I understand from your explanation why we can't
rely on pg_stat_replication after DISABLE but it might be better to
check that the slot is active before disabling it. I think currently,
the test assumes that, isn't it better to have an explicit check for
that?

I felt this is not required, wait_for_catchup will poll_query_until
the state = 'streaming', even if START_REPLICATION takes time, state
will be in 'startup' state, this way poll_query_until will take care
of handling this.

makes sense, but let's add some comments to clarify the same.

Modified.

On further analysis I found that we need to do "Alter subscription
tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple
time, Instead we can change pg_logical_slot_peek_binary_changes to
pg_logical_slot_get_binary_changes at appropriate steps. This way the
common function can be removed and the enable/disable multiple times
can be removed.

I think that is a valid point. This was probably kept so that we can
peek multiple times for the same message to test various things but
that can be achieved with the way you have changed the test.

One more thing, shouldn't we make auto_vacuum=off for this test by
using 'append_conf' before starting the publisher. That will avoid the
risk of empty transactions.

I felt that makes sense, added it.

Thanks for the comments, the attached patch has the changes for the same.

Regards,
Vignesh

Attachments:

v4-0001-Messages-test-failure-fix.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Messages-test-failure-fix.patchDownload
From 7617be76996efd17f46cee16f9a9a18a23183f19 Mon Sep 17 00:00:00 2001
From: vignesh <vignesh21@gmail.com>
Date: Mon, 17 May 2021 17:23:35 +0530
Subject: [PATCH v4] Messages test failure fix.

This test was failing because there was no wait for catch up after
creating subscription. After disabling subscription pg_stat_replication
was used to verify if the walsender is exited. The steps of walsender
exit includes cleaning up of walsendr and then releasing replication
slot. There will be random test failure if pg_logical_slot_peek_binary_changes
is called in the narrow time window. Fixed it by checking
active column in pg_replication_slot instead of pg_stat_replication
which is more reliable.
---
 src/test/subscription/t/020_messages.pl | 32 ++++++++-----------------
 1 file changed, 10 insertions(+), 22 deletions(-)

diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 0940d0f45f..aaafc9f11d 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -11,6 +11,8 @@ use Test::More tests => 5;
 # Create publisher node
 my $node_publisher = get_new_node('publisher');
 $node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	'autovacuum = off');
 $node_publisher->start;
 
 # Create subscriber node
@@ -35,12 +37,17 @@ $node_subscriber->safe_psql('postgres',
 	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
 );
 
+# Here wait_for_catchup is used intentionally instead of pg_replication_slots
+# as wait_for_catchup waits on poll_query_until until the state becomes
+# 'streaming', which will make sure that the replication slot is active.
+$node_publisher->wait_for_catchup('tap_sub');
+
 # Ensure a transactional logical decoding message shows up on the slot
 $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
 
 # wait for the replication connection to drop from the publisher
 $node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
 
 $node_publisher->safe_psql('postgres',
 	"SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')"
@@ -77,7 +84,7 @@ is($result, qq(1|pgoutput),
 $result = $node_publisher->safe_psql(
 	'postgres', qq(
 		SELECT get_byte(data, 0)
-		FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
+		FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
 			'proto_version', '1',
 			'publication_names', 'tap_pub')
 ));
@@ -88,16 +95,6 @@ is( $result, qq(66
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
-$node_publisher->wait_for_catchup('tap_sub');
-
-# ensure a non-transactional logical decoding message shows up on the slot
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
-
-# wait for the replication connection to drop from the publisher
-$node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
-
 $node_publisher->safe_psql('postgres', "INSERT INTO tab_test VALUES (1)");
 
 my $message_lsn = $node_publisher->safe_psql('postgres',
@@ -109,7 +106,7 @@ $node_publisher->safe_psql('postgres', "INSERT INTO tab_test VALUES (2)");
 $result = $node_publisher->safe_psql(
 	'postgres', qq(
 		SELECT get_byte(data, 0), get_byte(data, 1)
-		FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
+		FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
 			'proto_version', '1',
 			'publication_names', 'tap_pub',
 			'messages', 'true')
@@ -118,15 +115,6 @@ $result = $node_publisher->safe_psql(
 
 is($result, qq(77|0), 'non-transactional message on slot is M');
 
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
-$node_publisher->wait_for_catchup('tap_sub');
-
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
-
-# wait for the replication connection to drop from the publisher
-$node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
-
 # Ensure a non-transactional logical decoding message shows up on the slot when
 # it is emitted within an aborted transaction. The message won't emit until
 # something advances the LSN, hence, we intentionally forces the server to
-- 
2.25.1

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#11)
Re: subscriptioncheck failure

On Tue, May 18, 2021 at 11:25 AM vignesh C <vignesh21@gmail.com> wrote:

Thanks for the comments, the attached patch has the changes for the same.

Thanks, I have pushed your patch after making minor changes in the comments.

--
With Regards,
Amit Kapila.

#13vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#12)
Re: subscriptioncheck failure

On Wed, May 19, 2021 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, May 18, 2021 at 11:25 AM vignesh C <vignesh21@gmail.com> wrote:

Thanks for the comments, the attached patch has the changes for the same.

Thanks, I have pushed your patch after making minor changes in the comments.

Thanks for pushing this patch.

Regards,
Vignesh