Add two missing tests in 035_standby_logical_decoding.pl

Started by Drouvot, Bertrandalmost 3 years ago57 messages

bertranddrouvot.pg@gmail.com

almost 3 years ago

1 attachment(s)

hi hackers,

In the logical decoding on standby thread [1]/messages/by-id/6d801661-e21b-7326-be1b-f90d904da66a@gmail.com, Andres proposed 2 new tests (that I did
not find the time to complete before the finish line):

- Test that we can subscribe to the standby (with the publication created on the primary)
- Verify that invalidated logical slots do not lead to retaining WAL

Please find those 2 missing tests in the patch proposal attached.

A few words about them:

1) Regarding the subscription test:

It modifies wait_for_catchup() to take into account the case where the requesting
node is in recovery mode. Indeed, without that change, wait_for_subscription_sync() was
failing with:

"
error running SQL: 'psql:<stdin>:1: ERROR: recovery is in progress
HINT: WAL control functions cannot be executed during recovery.'
while running 'psql -XAtq -d port=61441 host=/tmp/45dt3wqs2p dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'SELECT pg_current_wal_lsn()'
"

2) Regarding the WAL file not retained test:

As it's not possible to execute pg_switch_wal() and friends on a standby, this is
done on the primary. Also checking that the WAL file (linked to a restart_lsn of an invalidate
slot) has been removed is done directly at the os/directory level.

The attached patch also removes:

"
-log_min_messages = 'debug2'
-log_error_verbosity = verbose
"

as also discussed in [1]/messages/by-id/6d801661-e21b-7326-be1b-f90d904da66a@gmail.com.

I'm not sure if adding those 2 tests should be considered as an open item. I can add this open item
if we think that makes sense. I'd be happy to do so but it looks like I don't have the privileges
to edit https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: /messages/by-id/6d801661-e21b-7326-be1b-f90d904da66a@gmail.com

Attachments:

v1-0001-add-missing-tests.patchtext/plain; charset=UTF-8; name=v1-0001-add-missing-tests.patchDownload

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 6f7f4e5de4..819667d42a 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2644,7 +2644,16 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		$target_lsn = $self->lsn('write');
+		my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($isrecovery);
+		if ($isrecovery eq 't')
+		{
+			$target_lsn = $self->lsn('replay');
+		}
+		else
+		{
+			$target_lsn = $self->lsn('write');
+		}
 	}
 	print "Waiting for replication conn "
 	  . $standby_name . "'s "
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 94a8384c31..09bc417356 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -9,13 +9,19 @@ use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
+use Time::HiRes qw(usleep);
 
-my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+my ($stdin,             $stdout,            $stderr,
+	$cascading_stdout,  $cascading_stderr,  $subscriber_stdin,
+	$subscriber_stdout, $subscriber_stderr, $ret,
+	$handle,            $slot);
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 my $node_standby = PostgreSQL::Test::Cluster->new('standby');
 my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
 my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $psql_timeout    = IPC::Run::timer(2 * $default_timeout);
 my $res;
 
 # Name for the physical slot on primary
@@ -235,8 +241,6 @@ $node_primary->append_conf('postgresql.conf', q{
 wal_level = 'logical'
 max_replication_slots = 4
 max_wal_senders = 4
-log_min_messages = 'debug2'
-log_error_verbosity = verbose
 });
 $node_primary->dump_info;
 $node_primary->start;
@@ -269,7 +273,8 @@ $node_standby->init_from_backup(
 	has_streaming => 1,
 	has_restoring => 1);
 $node_standby->append_conf('postgresql.conf',
-	qq[primary_slot_name = '$primary_slotname']);
+	qq[primary_slot_name = '$primary_slotname'
+       max_replication_slots = 5]);
 $node_standby->start;
 $node_primary->wait_for_replay_catchup($node_standby);
 $node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
@@ -287,6 +292,27 @@ $node_cascading_standby->append_conf('postgresql.conf',
 $node_cascading_standby->start;
 $node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
 
+#######################
+# Initialize subscriber node
+#######################
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', 'max_replication_slots = 4');
+$node_subscriber->start;
+
+my %psql_subscriber = (
+	'subscriber_stdin'  => '',
+	'subscriber_stdout' => '',
+	'subscriber_stderr' => '');
+$psql_subscriber{run} = IPC::Run::start(
+	[ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ],
+	'<',
+	\$psql_subscriber{subscriber_stdin},
+	'>',
+	\$psql_subscriber{subscriber_stdout},
+	'2>',
+	\$psql_subscriber{subscriber_stderr},
+	$psql_timeout);
+
 ##################################################
 # Test that logical decoding on the standby
 # behaves correctly.
@@ -367,6 +393,58 @@ is( $node_primary->psql(
     3,
     'replaying logical slot from another database fails');
 
+##################################################
+# Test that we can subscribe on the standby with the publication
+# created on the primary.
+##################################################
+
+# Create a table on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE tab_rep (a int primary key)");
+
+# Create a table (same structure) on the subscriber node
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_rep (a int primary key)");
+
+# Create a publication on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub for table tab_rep");
+
+# Subscribe on the standby
+my $standby_connstr = $node_standby->connstr . ' dbname=postgres';
+$psql_subscriber{subscriber_stdin} .=
+  qq[CREATE SUBSCRIPTION tap_sub CONNECTION '$standby_connstr' PUBLICATION tap_pub;];
+$psql_subscriber{subscriber_stdin} .= "\n";
+
+$psql_subscriber{run}->pump_nb();
+
+# Speed up the subscription creation
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+
+# Explicitly shut down psql instance gracefully - to avoid hangs
+# or worse on windows
+$psql_subscriber{subscriber_stdin} .= "\\q\n";
+$psql_subscriber{run}->finish;
+
+# Insert some rows on the primary
+$node_primary->safe_psql('postgres',
+	qq[INSERT INTO tab_rep select generate_series(1,10);]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# To speed up the wait_for_subscription_sync
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');
+
+# Check that the subscriber can see the rows inserted in the primary
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_rep");
+is($result, qq(10), 'check replicated inserts after subscription on standby');
+
+# We do not need the subscriber anymore
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+$node_subscriber->stop;
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
@@ -410,9 +488,82 @@ $node_standby->restart;
 check_slots_conflicting_status(1);
 
 ##################################################
-# Verify that invalidated logical slots do not lead to retaining WAL
+# Verify that invalidated logical slots do not lead to retaining WAL.
 ##################################################
-# XXXXX TODO
+
+# Get the restart_lsn from an invalidated slot
+my $restart_lsn = $node_standby->safe_psql('postgres',
+	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_activeslot' and conflicting is true;"
+);
+
+chomp($restart_lsn);
+
+# Get the WAL file name associated to this lsn on the primary
+my $walfile_name = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name('$restart_lsn')");
+
+chomp($walfile_name);
+
+# Check the WAL file is still on the primary
+ok(-f $node_primary->data_dir . '/pg_wal/' . $walfile_name,
+	"WAL file still on the primary");
+
+# Get the number of WAL files on the standby
+my $nb_standby_files = $node_standby->safe_psql('postgres',
+	"SELECT COUNT(*) FROM pg_ls_dir('pg_wal')");
+
+chomp($nb_standby_files);
+
+# Switch WAL files on the primary
+my @c = (1 .. $nb_standby_files);
+
+$node_primary->safe_psql('postgres', "create table retain_test(a int)");
+
+for (@c)
+{
+	$node_primary->safe_psql(
+		'postgres', "SELECT pg_switch_wal();
+                                          insert into retain_test values("
+		  . $_ . ");");
+}
+
+# Ask for a checkpoint
+$node_primary->safe_psql('postgres', 'checkpoint;');
+
+# Check that the WAL file has not been retained on the primary
+ok(!-f $node_primary->data_dir . '/pg_wal/' . $walfile_name,
+	"WAL file not on the primary anymore");
+
+# Wait for the standby to catch up
+$node_primary->wait_for_catchup($node_standby);
+
+# Generate another WAL switch, more activity and a checkpoint
+$node_primary->safe_psql(
+	'postgres', "SELECT pg_switch_wal();
+                                         insert into retain_test values(1);");
+$node_primary->safe_psql('postgres', 'checkpoint;');
+
+# Wait for the standby to catch up
+$node_primary->wait_for_catchup($node_standby);
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
+
+# We can not test if the WAL file still exists immediately.
+# We need to let some time to the standby to actually "remove" it.
+my $i = 0;
+while (1)
+{
+	last if !-f $standby_walfile;
+	if ($i++ == 10 * $default_timeout)
+	{
+		die
+		  "could not determine if WAL file has been retained or not, can't continue";
+	}
+	usleep(100_000);
+}
+
+ok(1, "invalidated logical slots do not lead to retaining WAL");
 
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots

Alvaro Herrera

alvherre@alvh.no-ip.org

over 2 years ago

In reply to: Drouvot, Bertrand (#1)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On 2023-Apr-12, Drouvot, Bertrand wrote:

I'm not sure if adding those 2 tests should be considered as an open
item. I can add this open item if we think that makes sense. I'd be
happy to do so but it looks like I don't have the privileges to edit
https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items

I think adding extra tests for new code can definitely be considered an
open item, since those tests might help to discover issues in said new
code.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Alvaro Herrera (#2)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 4/17/23 11:55 AM, Alvaro Herrera wrote:

On 2023-Apr-12, Drouvot, Bertrand wrote:

I'm not sure if adding those 2 tests should be considered as an open
item. I can add this open item if we think that makes sense. I'd be
happy to do so but it looks like I don't have the privileges to edit
https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items

I think adding extra tests for new code can definitely be considered an
open item, since those tests might help to discover issues in said new
code.

Thanks for the feedback! Added as an open item.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

vignesh C

vignesh21@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#1)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Wed, 12 Apr 2023 at 21:45, Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

hi hackers,

In the logical decoding on standby thread [1], Andres proposed 2 new tests (that I did
not find the time to complete before the finish line):

- Test that we can subscribe to the standby (with the publication created on the primary)
- Verify that invalidated logical slots do not lead to retaining WAL

Please find those 2 missing tests in the patch proposal attached.

Few comments:
1) Should this change be committed as a separate patch instead of
mixing it with the new test addition patch? I feel it would be better
to split it into 0001 and 0002 patches.
# Name for the physical slot on primary
@@ -235,8 +241,6 @@ $node_primary->append_conf('postgresql.conf', q{
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
-log_min_messages = 'debug2'
-log_error_verbosity = verbose
});
$node_primary->dump_info;
$node_primary->start;

2) We could add a commitfest entry for this, which will help in
checking cfbot results across platforms.

3) Should the comment say subscription instead of subscriber here?
+# We do not need the subscriber anymore
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+$node_subscriber->stop;

4) we could add a commit message for the patch

Regards,
Vignesh

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: vignesh C (#4)

2 attachment(s)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 4/24/23 6:04 AM, vignesh C wrote:

On Wed, 12 Apr 2023 at 21:45, Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

hi hackers,

In the logical decoding on standby thread [1], Andres proposed 2 new tests (that I did
not find the time to complete before the finish line):

- Test that we can subscribe to the standby (with the publication created on the primary)
- Verify that invalidated logical slots do not lead to retaining WAL

Please find those 2 missing tests in the patch proposal attached.

Few comments:

Thanks for looking at it!

1) Should this change be committed as a separate patch instead of
mixing it with the new test addition patch? I feel it would be better
to split it into 0001 and 0002 patches.

Agree, done in V2 attached.

2) We could add a commitfest entry for this, which will help in
checking cfbot results across platforms.

Good point, done in [1]https://commitfest.postgresql.org/43/4295/.

3) Should the comment say subscription instead of subscriber here?
+# We do not need the subscriber anymore
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+$node_subscriber->stop;

Comment was due to the node_subscriber being stopped. Changed to
"We do not need the subscription and the subscriber anymore"
in V2.

4) we could add a commit message for the patch

Good point, done.

[1]: https://commitfest.postgresql.org/43/4295/

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0002-Add-two-tests-in-035_standby_logical_decoding.pl.patchtext/plain; charset=UTF-8; name=v2-0002-Add-two-tests-in-035_standby_logical_decoding.pl.patchDownload

From 9fbeb44eef53234000cccd975240a313d3956cc4 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Apr 2023 05:13:23 +0000
Subject: [PATCH v2 2/2] Add two tests in 035_standby_logical_decoding.pl

Adding two tests, to verify that:

- subscribtion to the standby is possible
- invalidated logical slots do not lead to retaining WAL
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  11 +-
 .../t/035_standby_logical_decoding.pl         | 161 +++++++++++++++++-
 2 files changed, 167 insertions(+), 5 deletions(-)
   4.6% src/test/perl/PostgreSQL/Test/
  95.3% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 6f7f4e5de4..819667d42a 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2644,7 +2644,16 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		$target_lsn = $self->lsn('write');
+		my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($isrecovery);
+		if ($isrecovery eq 't')
+		{
+			$target_lsn = $self->lsn('replay');
+		}
+		else
+		{
+			$target_lsn = $self->lsn('write');
+		}
 	}
 	print "Waiting for replication conn "
 	  . $standby_name . "'s "
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index b8f5311fe9..151ed3f496 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -9,13 +9,19 @@ use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
+use Time::HiRes qw(usleep);
 
-my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+my ($stdin,             $stdout,            $stderr,
+	$cascading_stdout,  $cascading_stderr,  $subscriber_stdin,
+	$subscriber_stdout, $subscriber_stderr, $ret,
+	$handle,            $slot);
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 my $node_standby = PostgreSQL::Test::Cluster->new('standby');
 my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
 my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $psql_timeout    = IPC::Run::timer(2 * $default_timeout);
 my $res;
 
 # Name for the physical slot on primary
@@ -267,7 +273,8 @@ $node_standby->init_from_backup(
 	has_streaming => 1,
 	has_restoring => 1);
 $node_standby->append_conf('postgresql.conf',
-	qq[primary_slot_name = '$primary_slotname']);
+	qq[primary_slot_name = '$primary_slotname'
+       max_replication_slots = 5]);
 $node_standby->start;
 $node_primary->wait_for_replay_catchup($node_standby);
 $node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
@@ -285,6 +292,27 @@ $node_cascading_standby->append_conf('postgresql.conf',
 $node_cascading_standby->start;
 $node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
 
+#######################
+# Initialize subscriber node
+#######################
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', 'max_replication_slots = 4');
+$node_subscriber->start;
+
+my %psql_subscriber = (
+	'subscriber_stdin'  => '',
+	'subscriber_stdout' => '',
+	'subscriber_stderr' => '');
+$psql_subscriber{run} = IPC::Run::start(
+	[ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ],
+	'<',
+	\$psql_subscriber{subscriber_stdin},
+	'>',
+	\$psql_subscriber{subscriber_stdout},
+	'2>',
+	\$psql_subscriber{subscriber_stderr},
+	$psql_timeout);
+
 ##################################################
 # Test that logical decoding on the standby
 # behaves correctly.
@@ -365,6 +393,58 @@ is( $node_primary->psql(
     3,
     'replaying logical slot from another database fails');
 
+##################################################
+# Test that we can subscribe on the standby with the publication
+# created on the primary.
+##################################################
+
+# Create a table on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE tab_rep (a int primary key)");
+
+# Create a table (same structure) on the subscriber node
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_rep (a int primary key)");
+
+# Create a publication on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub for table tab_rep");
+
+# Subscribe on the standby
+my $standby_connstr = $node_standby->connstr . ' dbname=postgres';
+$psql_subscriber{subscriber_stdin} .=
+  qq[CREATE SUBSCRIPTION tap_sub CONNECTION '$standby_connstr' PUBLICATION tap_pub;];
+$psql_subscriber{subscriber_stdin} .= "\n";
+
+$psql_subscriber{run}->pump_nb();
+
+# Speed up the subscription creation
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+
+# Explicitly shut down psql instance gracefully - to avoid hangs
+# or worse on windows
+$psql_subscriber{subscriber_stdin} .= "\\q\n";
+$psql_subscriber{run}->finish;
+
+# Insert some rows on the primary
+$node_primary->safe_psql('postgres',
+	qq[INSERT INTO tab_rep select generate_series(1,10);]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# To speed up the wait_for_subscription_sync
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');
+
+# Check that the subscriber can see the rows inserted in the primary
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_rep");
+is($result, qq(10), 'check replicated inserts after subscription on standby');
+
+# We do not need the subscription and the subscriber anymore
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+$node_subscriber->stop;
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
@@ -408,9 +488,82 @@ $node_standby->restart;
 check_slots_conflicting_status(1);
 
 ##################################################
-# Verify that invalidated logical slots do not lead to retaining WAL
+# Verify that invalidated logical slots do not lead to retaining WAL.
 ##################################################
-# XXXXX TODO
+
+# Get the restart_lsn from an invalidated slot
+my $restart_lsn = $node_standby->safe_psql('postgres',
+	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_activeslot' and conflicting is true;"
+);
+
+chomp($restart_lsn);
+
+# Get the WAL file name associated to this lsn on the primary
+my $walfile_name = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name('$restart_lsn')");
+
+chomp($walfile_name);
+
+# Check the WAL file is still on the primary
+ok(-f $node_primary->data_dir . '/pg_wal/' . $walfile_name,
+	"WAL file still on the primary");
+
+# Get the number of WAL files on the standby
+my $nb_standby_files = $node_standby->safe_psql('postgres',
+	"SELECT COUNT(*) FROM pg_ls_dir('pg_wal')");
+
+chomp($nb_standby_files);
+
+# Switch WAL files on the primary
+my @c = (1 .. $nb_standby_files);
+
+$node_primary->safe_psql('postgres', "create table retain_test(a int)");
+
+for (@c)
+{
+	$node_primary->safe_psql(
+		'postgres', "SELECT pg_switch_wal();
+                                          insert into retain_test values("
+		  . $_ . ");");
+}
+
+# Ask for a checkpoint
+$node_primary->safe_psql('postgres', 'checkpoint;');
+
+# Check that the WAL file has not been retained on the primary
+ok(!-f $node_primary->data_dir . '/pg_wal/' . $walfile_name,
+	"WAL file not on the primary anymore");
+
+# Wait for the standby to catch up
+$node_primary->wait_for_catchup($node_standby);
+
+# Generate another WAL switch, more activity and a checkpoint
+$node_primary->safe_psql(
+	'postgres', "SELECT pg_switch_wal();
+                                         insert into retain_test values(1);");
+$node_primary->safe_psql('postgres', 'checkpoint;');
+
+# Wait for the standby to catch up
+$node_primary->wait_for_catchup($node_standby);
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
+
+# We can not test if the WAL file still exists immediately.
+# We need to let some time to the standby to actually "remove" it.
+my $i = 0;
+while (1)
+{
+	last if !-f $standby_walfile;
+	if ($i++ == 10 * $default_timeout)
+	{
+		die
+		  "could not determine if WAL file has been retained or not, can't continue";
+	}
+	usleep(100_000);
+}
+
+ok(1, "invalidated logical slots do not lead to retaining WAL");
 
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
-- 
2.34.1

v2-0001-Lower-log-level-in-035_standby_logical_decoding.p.patchtext/plain; charset=UTF-8; name=v2-0001-Lower-log-level-in-035_standby_logical_decoding.p.patchDownload

From d7ea3bddda655de8f5c91ca9bd83859e8cf2cba6 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Apr 2023 05:10:53 +0000
Subject: [PATCH v2 1/2] Lower log level in 035_standby_logical_decoding.pl

Lower log level in 035_standby_logical_decoding.pl to shrink the output size
and speed up the test a bit.
---
 src/test/recovery/t/035_standby_logical_decoding.pl | 2 --
 1 file changed, 2 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 20838d49b5..b8f5311fe9 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -235,8 +235,6 @@ $node_primary->append_conf('postgresql.conf', q{
 wal_level = 'logical'
 max_replication_slots = 4
 max_wal_senders = 4
-log_min_messages = 'debug2'
-log_error_verbosity = verbose
 });
 $node_primary->dump_info;
 $node_primary->start;
-- 
2.34.1

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#5)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Few comments:
============
1.
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', 'max_replication_slots = 4');

Why do we need slots on the subscriber?

2.
+# Speed up the subscription creation
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+
+# Explicitly shut down psql instance gracefully - to avoid hangs
+# or worse on windows
+$psql_subscriber{subscriber_stdin} .= "\\q\n";
+$psql_subscriber{run}->finish;
+
+# Insert some rows on the primary
+$node_primary->safe_psql('postgres',
+ qq[INSERT INTO tab_rep select generate_series(1,10);]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# To speed up the wait_for_subscription_sync
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');

It is not clear to me why you need to do pg_log_standby_snapshot() twice.

3. Why do you need $psql_subscriber to be used in a different way
instead of using safe_psql as is used for node_primary?

--
With Regards,
Amit Kapila.

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#1)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Wed, Apr 12, 2023 at 9:45 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

The attached patch also removes:

"
-log_min_messages = 'debug2'
-log_error_verbosity = verbose
"

as also discussed in [1].

I agree that we should reduce the log level here. It is discussed in
an email [1]/messages/by-id/523315.1681245505@sss.pgh.pa.us. I'll push this part tomorrow unless Andres or someone
else thinks that we still need this.

[1]: /messages/by-id/523315.1681245505@sss.pgh.pa.us

--
With Regards,
Amit Kapila.

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Amit Kapila (#6)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Few comments:
============

+# We can not test if the WAL file still exists immediately.
+# We need to let some time to the standby to actually "remove" it.
+my $i = 0;
+while (1)
+{
+ last if !-f $standby_walfile;
+ if ($i++ == 10 * $default_timeout)
+ {
+ die
+   "could not determine if WAL file has been retained or not, can't continue";
+ }
+ usleep(100_000);
+}

Is this adhoc wait required because we can't guarantee that the
checkpoint is complete on standby even after using wait_for_catchup?
Is there a guarantee that it can never fail on some slower machines?

BTW, for the second test is it necessary that we first ensure that the
WAL file has not been retained on the primary?

--
With Regards,
Amit Kapila.

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#6)

2 attachment(s)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 4/24/23 8:24 AM, Amit Kapila wrote:

On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Few comments:
============

Thanks for looking at it!

1.
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', 'max_replication_slots = 4');

Why do we need slots on the subscriber?

Good point, it's not needed. I guess it has been missed during my initial patch clean up.

Fixed in V3 attached.

2.
+# Speed up the subscription creation
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+
+# Explicitly shut down psql instance gracefully - to avoid hangs
+# or worse on windows
+$psql_subscriber{subscriber_stdin} .= "\\q\n";
+$psql_subscriber{run}->finish;
+
+# Insert some rows on the primary
+$node_primary->safe_psql('postgres',
+ qq[INSERT INTO tab_rep select generate_series(1,10);]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# To speed up the wait_for_subscription_sync
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');

It is not clear to me why you need to do pg_log_standby_snapshot() twice.

That's because there is 2 logical slot creations that have the be done on the standby.

The one for the subscription:

"
CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput (SNAPSHOT 'nothing')
"

And the one for the data sync:

"
CREATE_REPLICATION_SLOT "pg_16389_sync_16384_7225540800768250444" LOGICAL pgoutput (SNAPSHOT 'use')
"

Without the second "pg_log_standby_snapshot()" then wait_for_subscription_sync() would be waiting
some time on the poll for "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');"

Adding a comment in V3 to explain the need for the second pg_log_standby_snapshot().

3. Why do you need $psql_subscriber to be used in a different way
instead of using safe_psql as is used for node_primary?

Because safe_psql() would wait for activity on the primary without being able to launch
pg_log_standby_snapshot() on the primary while waiting. psql_subscriber() allows
to not wait synchronously.

Also adding a comment in V3 to explain why safe_psql() is not being used here.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0002-Add-two-tests-in-035_standby_logical_decoding.pl.patchtext/plain; charset=UTF-8; name=v3-0002-Add-two-tests-in-035_standby_logical_decoding.pl.patchDownload

From f74fb99ff67e47bb78bdff62bdf1ac824a8fb24b Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Apr 2023 05:13:23 +0000
Subject: [PATCH v3 2/2] Add two tests in 035_standby_logical_decoding.pl

Adding two tests, to verify that:

- subscribtion to the standby is possible
- invalidated logical slots do not lead to retaining WAL
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  11 +-
 .../t/035_standby_logical_decoding.pl         | 167 +++++++++++++++++-
 2 files changed, 173 insertions(+), 5 deletions(-)
   4.4% src/test/perl/PostgreSQL/Test/
  95.5% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 6f7f4e5de4..819667d42a 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2644,7 +2644,16 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		$target_lsn = $self->lsn('write');
+		my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($isrecovery);
+		if ($isrecovery eq 't')
+		{
+			$target_lsn = $self->lsn('replay');
+		}
+		else
+		{
+			$target_lsn = $self->lsn('write');
+		}
 	}
 	print "Waiting for replication conn "
 	  . $standby_name . "'s "
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index b8f5311fe9..647e203b0f 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -9,13 +9,19 @@ use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
+use Time::HiRes qw(usleep);
 
-my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+my ($stdin,             $stdout,            $stderr,
+	$cascading_stdout,  $cascading_stderr,  $subscriber_stdin,
+	$subscriber_stdout, $subscriber_stderr, $ret,
+	$handle,            $slot);
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 my $node_standby = PostgreSQL::Test::Cluster->new('standby');
 my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
 my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $psql_timeout    = IPC::Run::timer(2 * $default_timeout);
 my $res;
 
 # Name for the physical slot on primary
@@ -267,7 +273,8 @@ $node_standby->init_from_backup(
 	has_streaming => 1,
 	has_restoring => 1);
 $node_standby->append_conf('postgresql.conf',
-	qq[primary_slot_name = '$primary_slotname']);
+	qq[primary_slot_name = '$primary_slotname'
+       max_replication_slots = 5]);
 $node_standby->start;
 $node_primary->wait_for_replay_catchup($node_standby);
 $node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
@@ -285,6 +292,26 @@ $node_cascading_standby->append_conf('postgresql.conf',
 $node_cascading_standby->start;
 $node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
 
+#######################
+# Initialize subscriber node
+#######################
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+my %psql_subscriber = (
+	'subscriber_stdin'  => '',
+	'subscriber_stdout' => '',
+	'subscriber_stderr' => '');
+$psql_subscriber{run} = IPC::Run::start(
+	[ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ],
+	'<',
+	\$psql_subscriber{subscriber_stdin},
+	'>',
+	\$psql_subscriber{subscriber_stdout},
+	'2>',
+	\$psql_subscriber{subscriber_stderr},
+	$psql_timeout);
+
 ##################################################
 # Test that logical decoding on the standby
 # behaves correctly.
@@ -365,6 +392,65 @@ is( $node_primary->psql(
     3,
     'replaying logical slot from another database fails');
 
+##################################################
+# Test that we can subscribe on the standby with the publication
+# created on the primary.
+##################################################
+
+# Create a table on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE tab_rep (a int primary key)");
+
+# Create a table (same structure) on the subscriber node
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_rep (a int primary key)");
+
+# Create a publication on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub for table tab_rep");
+
+# Subscribe on the standby
+my $standby_connstr = $node_standby->connstr . ' dbname=postgres';
+
+# Not using safe_psql() here as it would wait for activity on the primary
+# without being able to launch pg_log_standby_snapshot() on the primary while
+# waiting.
+# psql_subscriber() allows to not wait synchronously.
+$psql_subscriber{subscriber_stdin} .=
+  qq[CREATE SUBSCRIPTION tap_sub CONNECTION '$standby_connstr' PUBLICATION tap_pub;];
+$psql_subscriber{subscriber_stdin} .= "\n";
+
+$psql_subscriber{run}->pump_nb();
+
+# Speed up the subscription creation
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+
+# Explicitly shut down psql instance gracefully - to avoid hangs
+# or worse on windows
+$psql_subscriber{subscriber_stdin} .= "\\q\n";
+$psql_subscriber{run}->finish;
+
+# Insert some rows on the primary
+$node_primary->safe_psql('postgres',
+	qq[INSERT INTO tab_rep select generate_series(1,10);]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# To speed up the wait_for_subscription_sync
+# Indeed a second logical slot (pg_%sync%: exact name coming from ReplicationSlotNameForTablesync())
+# is created under the hood on the standby for the data sync
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');
+
+# Check that the subscriber can see the rows inserted in the primary
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_rep");
+is($result, qq(10), 'check replicated inserts after subscription on standby');
+
+# We do not need the subscription and the subscriber anymore
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+$node_subscriber->stop;
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
@@ -408,9 +494,82 @@ $node_standby->restart;
 check_slots_conflicting_status(1);
 
 ##################################################
-# Verify that invalidated logical slots do not lead to retaining WAL
+# Verify that invalidated logical slots do not lead to retaining WAL.
 ##################################################
-# XXXXX TODO
+
+# Get the restart_lsn from an invalidated slot
+my $restart_lsn = $node_standby->safe_psql('postgres',
+	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_activeslot' and conflicting is true;"
+);
+
+chomp($restart_lsn);
+
+# Get the WAL file name associated to this lsn on the primary
+my $walfile_name = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name('$restart_lsn')");
+
+chomp($walfile_name);
+
+# Check the WAL file is still on the primary
+ok(-f $node_primary->data_dir . '/pg_wal/' . $walfile_name,
+	"WAL file still on the primary");
+
+# Get the number of WAL files on the standby
+my $nb_standby_files = $node_standby->safe_psql('postgres',
+	"SELECT COUNT(*) FROM pg_ls_dir('pg_wal')");
+
+chomp($nb_standby_files);
+
+# Switch WAL files on the primary
+my @c = (1 .. $nb_standby_files);
+
+$node_primary->safe_psql('postgres', "create table retain_test(a int)");
+
+for (@c)
+{
+	$node_primary->safe_psql(
+		'postgres', "SELECT pg_switch_wal();
+                                          insert into retain_test values("
+		  . $_ . ");");
+}
+
+# Ask for a checkpoint
+$node_primary->safe_psql('postgres', 'checkpoint;');
+
+# Check that the WAL file has not been retained on the primary
+ok(!-f $node_primary->data_dir . '/pg_wal/' . $walfile_name,
+	"WAL file not on the primary anymore");
+
+# Wait for the standby to catch up
+$node_primary->wait_for_catchup($node_standby);
+
+# Generate another WAL switch, more activity and a checkpoint
+$node_primary->safe_psql(
+	'postgres', "SELECT pg_switch_wal();
+                                         insert into retain_test values(1);");
+$node_primary->safe_psql('postgres', 'checkpoint;');
+
+# Wait for the standby to catch up
+$node_primary->wait_for_catchup($node_standby);
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
+
+# We can not test if the WAL file still exists immediately.
+# We need to let some time to the standby to actually "remove" it.
+my $i = 0;
+while (1)
+{
+	last if !-f $standby_walfile;
+	if ($i++ == 10 * $default_timeout)
+	{
+		die
+		  "could not determine if WAL file has been retained or not, can't continue";
+	}
+	usleep(100_000);
+}
+
+ok(1, "invalidated logical slots do not lead to retaining WAL");
 
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
-- 
2.34.1

v3-0001-Lower-log-level-in-035_standby_logical_decoding.p.patchtext/plain; charset=UTF-8; name=v3-0001-Lower-log-level-in-035_standby_logical_decoding.p.patchDownload

From dbbe19e8bec4e664f8e2d7093208bed1e6f57e31 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Apr 2023 05:10:53 +0000
Subject: [PATCH v3 1/2] Lower log level in 035_standby_logical_decoding.pl

Lower log level in 035_standby_logical_decoding.pl to shrink the output size
and speed up the test a bit.
---
 src/test/recovery/t/035_standby_logical_decoding.pl | 2 --
 1 file changed, 2 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 20838d49b5..b8f5311fe9 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -235,8 +235,6 @@ $node_primary->append_conf('postgresql.conf', q{
 wal_level = 'logical'
 max_replication_slots = 4
 max_wal_senders = 4
-log_min_messages = 'debug2'
-log_error_verbosity = verbose
 });
 $node_primary->dump_info;
 $node_primary->start;
-- 
2.34.1

#10

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#8)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 4/24/23 11:45 AM, Amit Kapila wrote:

On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Few comments:
============
+# We can not test if the WAL file still exists immediately.
+# We need to let some time to the standby to actually "remove" it.
+my $i = 0;
+while (1)
+{
+ last if !-f $standby_walfile;
+ if ($i++ == 10 * $default_timeout)
+ {
+ die
+   "could not determine if WAL file has been retained or not, can't continue";
+ }
+ usleep(100_000);
+}
Is this adhoc wait required because we can't guarantee that the
checkpoint is complete on standby even after using wait_for_catchup?

Yes, the restart point on the standby is not necessary completed even after wait_for_catchup is done.

Is there a guarantee that it can never fail on some slower machines?

We are waiting here at a maximum for 10 * $default_timeout (means 3 minutes) before
we time out. Would you prefer to wait more than 3 minutes at a maximum?

BTW, for the second test is it necessary that we first ensure that the
WAL file has not been retained on the primary?

I was not sure it's worth it too. Idea was more: it's useless to verify it is removed on
the standby if we are not 100% sure it has been removed on the primary first. But yeah, we can get
rid of this test if you prefer.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#11

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#9)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Mon, Apr 24, 2023 at 3:36 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 4/24/23 8:24 AM, Amit Kapila wrote:
2.
+# Speed up the subscription creation
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+
+# Explicitly shut down psql instance gracefully - to avoid hangs
+# or worse on windows
+$psql_subscriber{subscriber_stdin} .= "\\q\n";
+$psql_subscriber{run}->finish;
+
+# Insert some rows on the primary
+$node_primary->safe_psql('postgres',
+ qq[INSERT INTO tab_rep select generate_series(1,10);]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# To speed up the wait_for_subscription_sync
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');
It is not clear to me why you need to do pg_log_standby_snapshot() twice.
That's because there is 2 logical slot creations that have the be done on the standby.

The one for the subscription:

"
CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput (SNAPSHOT 'nothing')
"

And the one for the data sync:

"
CREATE_REPLICATION_SLOT "pg_16389_sync_16384_7225540800768250444" LOGICAL pgoutput (SNAPSHOT 'use')
"

Without the second "pg_log_standby_snapshot()" then wait_for_subscription_sync() would be waiting
some time on the poll for "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');"

Adding a comment in V3 to explain the need for the second pg_log_standby_snapshot().

Won't this still be unpredictable because it is possible that the
tablesync worker may take more time to get launched or create a
replication slot? If that happens after your second
pg_log_standby_snapshot() then wait_for_subscription_sync() will be
hanging. Wouldn't it be better to create a subscription with
(copy_data = false) to make it predictable and then we won't need
pg_log_standby_snapshot() to be performed twice?

If you agree with the above suggestion then you probably need to move
wait_for_subscription_sync() before Insert.

--
With Regards,
Amit Kapila.

#12

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#10)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Mon, Apr 24, 2023 at 5:38 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 4/24/23 11:45 AM, Amit Kapila wrote:
On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Few comments:
============
+# We can not test if the WAL file still exists immediately.
+# We need to let some time to the standby to actually "remove" it.
+my $i = 0;
+while (1)
+{
+ last if !-f $standby_walfile;
+ if ($i++ == 10 * $default_timeout)
+ {
+ die
+   "could not determine if WAL file has been retained or not, can't continue";
+ }
+ usleep(100_000);
+}
Is this adhoc wait required because we can't guarantee that the
checkpoint is complete on standby even after using wait_for_catchup?
Yes, the restart point on the standby is not necessary completed even after wait_for_catchup is done.

Is there a guarantee that it can never fail on some slower machines?

We are waiting here at a maximum for 10 * $default_timeout (means 3 minutes) before
we time out. Would you prefer to wait more than 3 minutes at a maximum?

No, because I don't know what would be a suitable timeout here. At
this stage, I don't have a good idea on how to implement this test in
a better way. Can we split this into a separate patch as the first
test is a bit straightforward, we can push that one and then
brainstorm on if there is a better way to test this functionality.

--
With Regards,
Amit Kapila.

#13

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#11)

2 attachment(s)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 4/25/23 6:23 AM, Amit Kapila wrote:

On Mon, Apr 24, 2023 at 3:36 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Without the second "pg_log_standby_snapshot()" then wait_for_subscription_sync() would be waiting
some time on the poll for "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');"

Adding a comment in V3 to explain the need for the second pg_log_standby_snapshot().

Won't this still be unpredictable because it is possible that the
tablesync worker may take more time to get launched or create a
replication slot? If that happens after your second
pg_log_standby_snapshot() then wait_for_subscription_sync() will be
hanging.

Oh right, that looks like a possible scenario.

Wouldn't it be better to create a subscription with
(copy_data = false) to make it predictable and then we won't need
pg_log_standby_snapshot() to be performed twice?

If you agree with the above suggestion then you probably need to move
wait_for_subscription_sync() before Insert.

I like that idea, thanks! Done in V4 attached.

Not related to the above corner case, but while re-reading the patch I also added:

"
$node_primary->wait_for_replay_catchup($node_standby);
"

between the publication creation on the primary and the subscription to the standby
(to ensure the publication gets replicated before we request for the subscription creation).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0002-Add-retained-WAL-test-in-035_standby_logical_deco.patchtext/plain; charset=UTF-8; name=v4-0002-Add-retained-WAL-test-in-035_standby_logical_deco.patchDownload

From ed92ea9bf3385d2fe9e4ef0d8a04b87b1c7e6b3d Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 25 Apr 2023 06:02:17 +0000
Subject: [PATCH v4 2/2] Add retained WAL test in
 035_standby_logical_decoding.pl

Adding one test, to verify that invalidated logical slots do not lead to
retaining WAL.
---
 .../t/035_standby_logical_decoding.pl         | 78 ++++++++++++++++++-
 1 file changed, 76 insertions(+), 2 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 03346f44f2..6ae4fc1e02 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -9,6 +9,7 @@ use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
+use Time::HiRes qw(usleep);
 
 my ($stdin,             $stdout,            $stderr,
 	$cascading_stdout,  $cascading_stderr,  $subscriber_stdin,
@@ -495,9 +496,82 @@ $node_standby->restart;
 check_slots_conflicting_status(1);
 
 ##################################################
-# Verify that invalidated logical slots do not lead to retaining WAL
+# Verify that invalidated logical slots do not lead to retaining WAL.
 ##################################################
-# XXXXX TODO
+
+# Get the restart_lsn from an invalidated slot
+my $restart_lsn = $node_standby->safe_psql('postgres',
+	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_activeslot' and conflicting is true;"
+);
+
+chomp($restart_lsn);
+
+# Get the WAL file name associated to this lsn on the primary
+my $walfile_name = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name('$restart_lsn')");
+
+chomp($walfile_name);
+
+# Check the WAL file is still on the primary
+ok(-f $node_primary->data_dir . '/pg_wal/' . $walfile_name,
+	"WAL file still on the primary");
+
+# Get the number of WAL files on the standby
+my $nb_standby_files = $node_standby->safe_psql('postgres',
+	"SELECT COUNT(*) FROM pg_ls_dir('pg_wal')");
+
+chomp($nb_standby_files);
+
+# Switch WAL files on the primary
+my @c = (1 .. $nb_standby_files);
+
+$node_primary->safe_psql('postgres', "create table retain_test(a int)");
+
+for (@c)
+{
+	$node_primary->safe_psql(
+		'postgres', "SELECT pg_switch_wal();
+                                          insert into retain_test values("
+		  . $_ . ");");
+}
+
+# Ask for a checkpoint
+$node_primary->safe_psql('postgres', 'checkpoint;');
+
+# Check that the WAL file has not been retained on the primary
+ok(!-f $node_primary->data_dir . '/pg_wal/' . $walfile_name,
+	"WAL file not on the primary anymore");
+
+# Wait for the standby to catch up
+$node_primary->wait_for_catchup($node_standby);
+
+# Generate another WAL switch, more activity and a checkpoint
+$node_primary->safe_psql(
+	'postgres', "SELECT pg_switch_wal();
+                                         insert into retain_test values(1);");
+$node_primary->safe_psql('postgres', 'checkpoint;');
+
+# Wait for the standby to catch up
+$node_primary->wait_for_catchup($node_standby);
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
+
+# We can not test if the WAL file still exists immediately.
+# We need to let some time to the standby to actually "remove" it.
+my $i = 0;
+while (1)
+{
+	last if !-f $standby_walfile;
+	if ($i++ == 10 * $default_timeout)
+	{
+		die
+		  "could not determine if WAL file has been retained or not, can't continue";
+	}
+	usleep(100_000);
+}
+
+ok(1, "invalidated logical slots do not lead to retaining WAL");
 
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
-- 
2.34.1

v4-0001-Add-subscribtion-to-the-standby-test-in-035_stand.patchtext/plain; charset=UTF-8; name=v4-0001-Add-subscribtion-to-the-standby-test-in-035_stand.patchDownload

From 53507b322cd0af79b15cb62eff8e39422a8f98c6 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Apr 2023 05:13:23 +0000
Subject: [PATCH v4 1/2] Add subscribtion to the standby test in
 035_standby_logical_decoding.pl

Adding one test, to verify that subscribtion to the standby is possible.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      | 11 ++-
 .../t/035_standby_logical_decoding.pl         | 91 ++++++++++++++++++-
 2 files changed, 99 insertions(+), 3 deletions(-)
   7.5% src/test/perl/PostgreSQL/Test/
  92.4% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 6f7f4e5de4..819667d42a 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2644,7 +2644,16 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		$target_lsn = $self->lsn('write');
+		my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($isrecovery);
+		if ($isrecovery eq 't')
+		{
+			$target_lsn = $self->lsn('replay');
+		}
+		else
+		{
+			$target_lsn = $self->lsn('write');
+		}
 	}
 	print "Waiting for replication conn "
 	  . $standby_name . "'s "
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index b8f5311fe9..03346f44f2 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -10,12 +10,17 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
-my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+my ($stdin,             $stdout,            $stderr,
+	$cascading_stdout,  $cascading_stderr,  $subscriber_stdin,
+	$subscriber_stdout, $subscriber_stderr, $ret,
+	$handle,            $slot);
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 my $node_standby = PostgreSQL::Test::Cluster->new('standby');
 my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
 my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $psql_timeout    = IPC::Run::timer(2 * $default_timeout);
 my $res;
 
 # Name for the physical slot on primary
@@ -267,7 +272,8 @@ $node_standby->init_from_backup(
 	has_streaming => 1,
 	has_restoring => 1);
 $node_standby->append_conf('postgresql.conf',
-	qq[primary_slot_name = '$primary_slotname']);
+	qq[primary_slot_name = '$primary_slotname'
+       max_replication_slots = 5]);
 $node_standby->start;
 $node_primary->wait_for_replay_catchup($node_standby);
 $node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
@@ -285,6 +291,26 @@ $node_cascading_standby->append_conf('postgresql.conf',
 $node_cascading_standby->start;
 $node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
 
+#######################
+# Initialize subscriber node
+#######################
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+my %psql_subscriber = (
+	'subscriber_stdin'  => '',
+	'subscriber_stdout' => '',
+	'subscriber_stderr' => '');
+$psql_subscriber{run} = IPC::Run::start(
+	[ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ],
+	'<',
+	\$psql_subscriber{subscriber_stdin},
+	'>',
+	\$psql_subscriber{subscriber_stdout},
+	'2>',
+	\$psql_subscriber{subscriber_stderr},
+	$psql_timeout);
+
 ##################################################
 # Test that logical decoding on the standby
 # behaves correctly.
@@ -365,6 +391,67 @@ is( $node_primary->psql(
     3,
     'replaying logical slot from another database fails');
 
+##################################################
+# Test that we can subscribe on the standby with the publication
+# created on the primary.
+##################################################
+
+# Create a table on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE tab_rep (a int primary key)");
+
+# Create a table (same structure) on the subscriber node
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_rep (a int primary key)");
+
+# Create a publication on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub for table tab_rep");
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Subscribe on the standby
+my $standby_connstr = $node_standby->connstr . ' dbname=postgres';
+
+# Not using safe_psql() here as it would wait for activity on the primary
+# and we wouldn't be able to launch pg_log_standby_snapshot() on the primary
+# while waiting.
+# psql_subscriber() allows to not wait synchronously.
+$psql_subscriber{subscriber_stdin} .=
+  qq[CREATE SUBSCRIPTION tap_sub
+     CONNECTION '$standby_connstr'
+     PUBLICATION tap_pub
+     WITH (copy_data = off);];
+$psql_subscriber{subscriber_stdin} .= "\n";
+
+$psql_subscriber{run}->pump_nb();
+
+# Speed up the subscription creation
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+
+# Explicitly shut down psql instance gracefully - to avoid hangs
+# or worse on windows
+$psql_subscriber{subscriber_stdin} .= "\\q\n";
+$psql_subscriber{run}->finish;
+
+$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');
+
+# Insert some rows on the primary
+$node_primary->safe_psql('postgres',
+	qq[INSERT INTO tab_rep select generate_series(1,10);]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_catchup('tap_sub');
+
+# Check that the subscriber can see the rows inserted in the primary
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_rep");
+is($result, qq(10), 'check replicated inserts after subscription on standby');
+
+# We do not need the subscription and the subscriber anymore
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+$node_subscriber->stop;
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
-- 
2.34.1

#14

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#12)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 4/25/23 6:43 AM, Amit Kapila wrote:

On Mon, Apr 24, 2023 at 5:38 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

We are waiting here at a maximum for 10 * $default_timeout (means 3 minutes) before
we time out. Would you prefer to wait more than 3 minutes at a maximum?

No, because I don't know what would be a suitable timeout here.

Yeah, I understand that. On the other hand, there is other places that
rely on a timeout, for example:

- wait_for_catchup(), wait_for_slot_catchup(),
wait_for_subscription_sync() by making use of poll_query_until.
- wait_for_log() by setting a max_attempts.

Couldn't we have the same concern for those ones? (aka be suitable on
slower machines).

At
this stage, I don't have a good idea on how to implement this test in
a better way. Can we split this into a separate patch as the first
test is a bit straightforward, we can push that one and then
brainstorm on if there is a better way to test this functionality.

I created a dedicated v4-0002-Add-retained-WAL-test-in-035_standby_logical_deco.patch
just shared up-thread.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#15

vignesh C

vignesh21@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#13)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Tue, 25 Apr 2023 at 12:51, Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On 4/25/23 6:23 AM, Amit Kapila wrote:

On Mon, Apr 24, 2023 at 3:36 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Without the second "pg_log_standby_snapshot()" then wait_for_subscription_sync() would be waiting
some time on the poll for "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');"

Adding a comment in V3 to explain the need for the second pg_log_standby_snapshot().

Won't this still be unpredictable because it is possible that the
tablesync worker may take more time to get launched or create a
replication slot? If that happens after your second
pg_log_standby_snapshot() then wait_for_subscription_sync() will be
hanging.

Oh right, that looks like a possible scenario.

Wouldn't it be better to create a subscription with
(copy_data = false) to make it predictable and then we won't need
pg_log_standby_snapshot() to be performed twice?

If you agree with the above suggestion then you probably need to move
wait_for_subscription_sync() before Insert.

I like that idea, thanks! Done in V4 attached.

Not related to the above corner case, but while re-reading the patch I also added:

"
$node_primary->wait_for_replay_catchup($node_standby);
"

between the publication creation on the primary and the subscription to the standby
(to ensure the publication gets replicated before we request for the subscription creation).

Thanks for the updated patch.
Few comments:
1) subscriber_stdout and  subscriber_stderr are not required for this
test case, we could remove it, I was able to remove those variables
and run the test successfully:
+$node_subscriber->start;
+
+my %psql_subscriber = (
+       'subscriber_stdin'  => '',
+       'subscriber_stdout' => '',
+       'subscriber_stderr' => '');
+$psql_subscriber{run} = IPC::Run::start(
+       [ 'psql', '-XA', '-f', '-', '-d',
$node_subscriber->connstr('postgres') ],
+       '<',
+       \$psql_subscriber{subscriber_stdin},
+       '>',
+       \$psql_subscriber{subscriber_stdout},
+       '2>',
+       \$psql_subscriber{subscriber_stderr},
+       $psql_timeout);

I ran it like:
my %psql_subscriber = (
'subscriber_stdin' => '');
$psql_subscriber{run} = IPC::Run::start(
[ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ],
'<',
\$psql_subscriber{subscriber_stdin},
$psql_timeout);

2) Also we have changed the default timeout here, why is this change required:
 my $node_cascading_standby =
PostgreSQL::Test::Cluster->new('cascading_standby');
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
 my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $psql_timeout    = IPC::Run::timer(2 * $default_timeout);

Regards,
Vignesh

#16

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: vignesh C (#15)

1 attachment(s)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 4/26/23 6:06 AM, vignesh C wrote:

On Tue, 25 Apr 2023 at 12:51, Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
Thanks for the updated patch.
Few comments:

Thanks for looking at it!

1) subscriber_stdout and  subscriber_stderr are not required for this
test case, we could remove it, I was able to remove those variables
and run the test successfully:
+$node_subscriber->start;
+
+my %psql_subscriber = (
+       'subscriber_stdin'  => '',
+       'subscriber_stdout' => '',
+       'subscriber_stderr' => '');
+$psql_subscriber{run} = IPC::Run::start(
+       [ 'psql', '-XA', '-f', '-', '-d',
$node_subscriber->connstr('postgres') ],
+       '<',
+       \$psql_subscriber{subscriber_stdin},
+       '>',
+       \$psql_subscriber{subscriber_stdout},
+       '2>',
+       \$psql_subscriber{subscriber_stderr},
+       $psql_timeout);

Not using the 3 std* is also the case for example in 021_row_visibility.pl and 032_relfilenode_reuse.pl
where the "stderr" is set but does not seem to be used.

I don't think that's a problem to keep them all and I think it's better to have
them re-directed to dedicated places.

2) Also we have changed the default timeout here, why is this change required:
my $node_cascading_standby =
PostgreSQL::Test::Cluster->new('cascading_standby');
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $psql_timeout    = IPC::Run::timer(2 * $default_timeout);

I think I used 021_row_visibility.pl as an example. But agree there is
others .pl that are using the timeout_default as the psql_timeout and that
the default is enough in our case. So, using the default in V5 attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v5-0001-Add-subscribtion-to-the-standby-test-in-035_stand.patchtext/plain; charset=UTF-8; name=v5-0001-Add-subscribtion-to-the-standby-test-in-035_stand.patchDownload

From 7d06ba14cce372b60e859b2274933b7c1a25b26a Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Apr 2023 05:13:23 +0000
Subject: [PATCH v5] Add subscribtion to the standby test in
 035_standby_logical_decoding.pl

Adding one test, to verify that subscribtion to the standby is possible.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      | 11 ++-
 .../t/035_standby_logical_decoding.pl         | 91 ++++++++++++++++++-
 2 files changed, 99 insertions(+), 3 deletions(-)
   7.5% src/test/perl/PostgreSQL/Test/
  92.4% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 6f7f4e5de4..819667d42a 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2644,7 +2644,16 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		$target_lsn = $self->lsn('write');
+		my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($isrecovery);
+		if ($isrecovery eq 't')
+		{
+			$target_lsn = $self->lsn('replay');
+		}
+		else
+		{
+			$target_lsn = $self->lsn('write');
+		}
 	}
 	print "Waiting for replication conn "
 	  . $standby_name . "'s "
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index b8f5311fe9..f6d6447412 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -10,12 +10,17 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
-my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+my ($stdin,             $stdout,            $stderr,
+	$cascading_stdout,  $cascading_stderr,  $subscriber_stdin,
+	$subscriber_stdout, $subscriber_stderr, $ret,
+	$handle,            $slot);
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 my $node_standby = PostgreSQL::Test::Cluster->new('standby');
 my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
 my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $psql_timeout    = IPC::Run::timer($default_timeout);
 my $res;
 
 # Name for the physical slot on primary
@@ -267,7 +272,8 @@ $node_standby->init_from_backup(
 	has_streaming => 1,
 	has_restoring => 1);
 $node_standby->append_conf('postgresql.conf',
-	qq[primary_slot_name = '$primary_slotname']);
+	qq[primary_slot_name = '$primary_slotname'
+       max_replication_slots = 5]);
 $node_standby->start;
 $node_primary->wait_for_replay_catchup($node_standby);
 $node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
@@ -285,6 +291,26 @@ $node_cascading_standby->append_conf('postgresql.conf',
 $node_cascading_standby->start;
 $node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
 
+#######################
+# Initialize subscriber node
+#######################
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+my %psql_subscriber = (
+	'subscriber_stdin'  => '',
+	'subscriber_stdout' => '',
+	'subscriber_stderr' => '');
+$psql_subscriber{run} = IPC::Run::start(
+	[ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ],
+	'<',
+	\$psql_subscriber{subscriber_stdin},
+	'>',
+	\$psql_subscriber{subscriber_stdout},
+	'2>',
+	\$psql_subscriber{subscriber_stderr},
+	$psql_timeout);
+
 ##################################################
 # Test that logical decoding on the standby
 # behaves correctly.
@@ -365,6 +391,67 @@ is( $node_primary->psql(
     3,
     'replaying logical slot from another database fails');
 
+##################################################
+# Test that we can subscribe on the standby with the publication
+# created on the primary.
+##################################################
+
+# Create a table on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE tab_rep (a int primary key)");
+
+# Create a table (same structure) on the subscriber node
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_rep (a int primary key)");
+
+# Create a publication on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub for table tab_rep");
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Subscribe on the standby
+my $standby_connstr = $node_standby->connstr . ' dbname=postgres';
+
+# Not using safe_psql() here as it would wait for activity on the primary
+# and we wouldn't be able to launch pg_log_standby_snapshot() on the primary
+# while waiting.
+# psql_subscriber() allows to not wait synchronously.
+$psql_subscriber{subscriber_stdin} .=
+  qq[CREATE SUBSCRIPTION tap_sub
+     CONNECTION '$standby_connstr'
+     PUBLICATION tap_pub
+     WITH (copy_data = off);];
+$psql_subscriber{subscriber_stdin} .= "\n";
+
+$psql_subscriber{run}->pump_nb();
+
+# Speed up the subscription creation
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+
+# Explicitly shut down psql instance gracefully - to avoid hangs
+# or worse on windows
+$psql_subscriber{subscriber_stdin} .= "\\q\n";
+$psql_subscriber{run}->finish;
+
+$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');
+
+# Insert some rows on the primary
+$node_primary->safe_psql('postgres',
+	qq[INSERT INTO tab_rep select generate_series(1,10);]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_catchup('tap_sub');
+
+# Check that the subscriber can see the rows inserted in the primary
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_rep");
+is($result, qq(10), 'check replicated inserts after subscription on standby');
+
+# We do not need the subscription and the subscriber anymore
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+$node_subscriber->stop;
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
-- 
2.34.1

#17

vignesh C

vignesh21@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#16)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Wed, 26 Apr 2023 at 13:45, Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On 4/26/23 6:06 AM, vignesh C wrote:

On Tue, 25 Apr 2023 at 12:51, Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
Thanks for the updated patch.
Few comments:

Thanks for looking at it!
1) subscriber_stdout and  subscriber_stderr are not required for this
test case, we could remove it, I was able to remove those variables
and run the test successfully:
+$node_subscriber->start;
+
+my %psql_subscriber = (
+       'subscriber_stdin'  => '',
+       'subscriber_stdout' => '',
+       'subscriber_stderr' => '');
+$psql_subscriber{run} = IPC::Run::start(
+       [ 'psql', '-XA', '-f', '-', '-d',
$node_subscriber->connstr('postgres') ],
+       '<',
+       \$psql_subscriber{subscriber_stdin},
+       '>',
+       \$psql_subscriber{subscriber_stdout},
+       '2>',
+       \$psql_subscriber{subscriber_stderr},
+       $psql_timeout);
I ran it like:
my %psql_subscriber = (
'subscriber_stdin' => '');
$psql_subscriber{run} = IPC::Run::start(
[ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ],
'<',
\$psql_subscriber{subscriber_stdin},
$psql_timeout);
Not using the 3 std* is also the case for example in 021_row_visibility.pl and 032_relfilenode_reuse.pl
where the "stderr" is set but does not seem to be used.

I don't think that's a problem to keep them all and I think it's better to have
them re-directed to dedicated places.

ok, that way it will be consistent across others too.

2) Also we have changed the default timeout here, why is this change required:
my $node_cascading_standby =
PostgreSQL::Test::Cluster->new('cascading_standby');
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $psql_timeout    = IPC::Run::timer(2 * $default_timeout);
I think I used 021_row_visibility.pl as an example. But agree there is
others .pl that are using the timeout_default as the psql_timeout and that
the default is enough in our case. So, using the default in V5 attached.

Thanks for fixing this.

There was one typo in the commit message, subscribtion should be
subscription, the rest of the changes looks good to me:
Subject: [PATCH v5] Add subscribtion to the standby test in
035_standby_logical_decoding.pl

Adding one test, to verify that subscribtion to the standby is possible.

Regards,
Vignesh

#18

Yu Shi (Fujitsu)

shiy.fnst@fujitsu.com

over 2 years ago

In reply to: Drouvot, Bertrand (#10)

RE: Add two missing tests in 035_standby_logical_decoding.pl

On Mon, Apr 24, 2023 8:07 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote:

On 4/24/23 11:45 AM, Amit Kapila wrote:

On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com>

wrote:
On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Few comments:
============
+# We can not test if the WAL file still exists immediately.
+# We need to let some time to the standby to actually "remove" it.
+my $i = 0;
+while (1)
+{
+ last if !-f $standby_walfile;
+ if ($i++ == 10 * $default_timeout)
+ {
+ die
+   "could not determine if WAL file has been retained or not, can't continue";
+ }
+ usleep(100_000);
+}
Is this adhoc wait required because we can't guarantee that the
checkpoint is complete on standby even after using wait_for_catchup?
Yes, the restart point on the standby is not necessary completed even after
wait_for_catchup is done.

I think that's because when replaying a checkpoint record, the startup process
of standby only saves the information of the checkpoint, and we need to wait for
the checkpointer to perform a restartpoint (see RecoveryRestartPoint), right? If
so, could we force a checkpoint on standby? After this, the standby should have
completed the restartpoint and we don't need to wait.

Besides, would it be better to wait for the cascading standby? If the wal log
file needed for cascading standby is removed on the standby, the subsequent test
will fail. Do we need to consider this scenario? I saw the following error
message after setting recovery_min_apply_delay to 5s on the cascading standby,
and the test failed due to a timeout while waiting for cascading standby.

Log of cascading standby node:
FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000003 has already been removed

Regards,
Shi Yu

#19

Alvaro Herrera

alvherre@alvh.no-ip.org

over 2 years ago

In reply to: Drouvot, Bertrand (#16)

Re: Add two missing tests in 035_standby_logical_decoding.pl

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 6f7f4e5de4..819667d42a 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2644,7 +2644,16 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
-		$target_lsn = $self->lsn('write');
+		my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($isrecovery);
+		if ($isrecovery eq 't')
+		{
+			$target_lsn = $self->lsn('replay');
+		}
+		else
+		{
+			$target_lsn = $self->lsn('write');
+		}

Please modify the function's documentation to account for this code change.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Porque Kim no hacía nada, pero, eso sí,
con extraordinario éxito" ("Kim", Kipling)

#20

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Alvaro Herrera (#19)

1 attachment(s)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 4/26/23 12:27 PM, Alvaro Herrera wrote:

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 6f7f4e5de4..819667d42a 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2644,7 +2644,16 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
-		$target_lsn = $self->lsn('write');
+		my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($isrecovery);
+		if ($isrecovery eq 't')
+		{
+			$target_lsn = $self->lsn('replay');
+		}
+		else
+		{
+			$target_lsn = $self->lsn('write');
+		}

Please modify the function's documentation to account for this code change.

Good point, thanks! Done in V6 attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v6-0001-Add-subscription-to-the-standby-test-in-035_stand.patchtext/plain; charset=UTF-8; name=v6-0001-Add-subscription-to-the-standby-test-in-035_stand.patchDownload

From 2a8b8bdc32671b33b2b7c0fa1a79f8d7624ae4a6 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Apr 2023 05:13:23 +0000
Subject: [PATCH v6] Add subscription to the standby test in
 035_standby_logical_decoding.pl

Adding one test, to verify that subscription to the standby is possible.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      | 21 ++++-
 .../t/035_standby_logical_decoding.pl         | 91 ++++++++++++++++++-
 2 files changed, 107 insertions(+), 5 deletions(-)
  18.8% src/test/perl/PostgreSQL/Test/
  81.1% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 6f7f4e5de4..9117554c07 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2611,8 +2611,14 @@ When doing physical replication, the standby is usually identified by
 passing its PostgreSQL::Test::Cluster instance.  When doing logical
 replication, standby_name identifies a subscription.
 
-The default value of target_lsn is $node->lsn('write'), which ensures
-that the standby has caught up to what has been committed on the primary.
+When not in recovery, the default value of target_lsn is $node->lsn('write'),
+which ensures that the standby has caught up to what has been committed on
+the primary.
+
+When in recovery, the default value of target_lsn is $node->lsn('replay')
+instead. This is needed when the publisher passed to wait_for_subscription_sync()
+is a standby node.
+
 If you pass an explicit value of target_lsn, it should almost always be
 the primary's write LSN; so this parameter is seldom needed except when
 querying some intermediate replication node rather than the primary.
@@ -2644,7 +2650,16 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		$target_lsn = $self->lsn('write');
+		my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($isrecovery);
+		if ($isrecovery eq 't')
+		{
+			$target_lsn = $self->lsn('replay');
+		}
+		else
+		{
+			$target_lsn = $self->lsn('write');
+		}
 	}
 	print "Waiting for replication conn "
 	  . $standby_name . "'s "
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index b8f5311fe9..f6d6447412 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -10,12 +10,17 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
-my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+my ($stdin,             $stdout,            $stderr,
+	$cascading_stdout,  $cascading_stderr,  $subscriber_stdin,
+	$subscriber_stdout, $subscriber_stderr, $ret,
+	$handle,            $slot);
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 my $node_standby = PostgreSQL::Test::Cluster->new('standby');
 my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
 my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $psql_timeout    = IPC::Run::timer($default_timeout);
 my $res;
 
 # Name for the physical slot on primary
@@ -267,7 +272,8 @@ $node_standby->init_from_backup(
 	has_streaming => 1,
 	has_restoring => 1);
 $node_standby->append_conf('postgresql.conf',
-	qq[primary_slot_name = '$primary_slotname']);
+	qq[primary_slot_name = '$primary_slotname'
+       max_replication_slots = 5]);
 $node_standby->start;
 $node_primary->wait_for_replay_catchup($node_standby);
 $node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
@@ -285,6 +291,26 @@ $node_cascading_standby->append_conf('postgresql.conf',
 $node_cascading_standby->start;
 $node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
 
+#######################
+# Initialize subscriber node
+#######################
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+my %psql_subscriber = (
+	'subscriber_stdin'  => '',
+	'subscriber_stdout' => '',
+	'subscriber_stderr' => '');
+$psql_subscriber{run} = IPC::Run::start(
+	[ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ],
+	'<',
+	\$psql_subscriber{subscriber_stdin},
+	'>',
+	\$psql_subscriber{subscriber_stdout},
+	'2>',
+	\$psql_subscriber{subscriber_stderr},
+	$psql_timeout);
+
 ##################################################
 # Test that logical decoding on the standby
 # behaves correctly.
@@ -365,6 +391,67 @@ is( $node_primary->psql(
     3,
     'replaying logical slot from another database fails');
 
+##################################################
+# Test that we can subscribe on the standby with the publication
+# created on the primary.
+##################################################
+
+# Create a table on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE tab_rep (a int primary key)");
+
+# Create a table (same structure) on the subscriber node
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_rep (a int primary key)");
+
+# Create a publication on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub for table tab_rep");
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Subscribe on the standby
+my $standby_connstr = $node_standby->connstr . ' dbname=postgres';
+
+# Not using safe_psql() here as it would wait for activity on the primary
+# and we wouldn't be able to launch pg_log_standby_snapshot() on the primary
+# while waiting.
+# psql_subscriber() allows to not wait synchronously.
+$psql_subscriber{subscriber_stdin} .=
+  qq[CREATE SUBSCRIPTION tap_sub
+     CONNECTION '$standby_connstr'
+     PUBLICATION tap_pub
+     WITH (copy_data = off);];
+$psql_subscriber{subscriber_stdin} .= "\n";
+
+$psql_subscriber{run}->pump_nb();
+
+# Speed up the subscription creation
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+
+# Explicitly shut down psql instance gracefully - to avoid hangs
+# or worse on windows
+$psql_subscriber{subscriber_stdin} .= "\\q\n";
+$psql_subscriber{run}->finish;
+
+$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');
+
+# Insert some rows on the primary
+$node_primary->safe_psql('postgres',
+	qq[INSERT INTO tab_rep select generate_series(1,10);]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_catchup('tap_sub');
+
+# Check that the subscriber can see the rows inserted in the primary
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_rep");
+is($result, qq(10), 'check replicated inserts after subscription on standby');
+
+# We do not need the subscription and the subscriber anymore
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+$node_subscriber->stop;
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
-- 
2.34.1

#21

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: vignesh C (#17)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 4/26/23 11:12 AM, vignesh C wrote:

On Wed, 26 Apr 2023 at 13:45, Drouvot, Bertrand

There was one typo in the commit message, subscribtion should be
subscription, the rest of the changes looks good to me:
Subject: [PATCH v5] Add subscribtion to the standby test in
035_standby_logical_decoding.pl

Adding one test, to verify that subscribtion to the standby is possible.

Oops, at least I repeated it twice ;-)
Fixed in V6 that I just shared up-thread.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#22

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Yu Shi (Fujitsu) (#18)

1 attachment(s)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 4/26/23 11:58 AM, Yu Shi (Fujitsu) wrote:

On Mon, Apr 24, 2023 8:07 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote:

I think that's because when replaying a checkpoint record, the startup process
of standby only saves the information of the checkpoint, and we need to wait for
the checkpointer to perform a restartpoint (see RecoveryRestartPoint), right? If
so, could we force a checkpoint on standby? After this, the standby should have
completed the restartpoint and we don't need to wait.

Thanks for looking at it!

Oh right, that looks like good a good way to ensure the WAL file is removed on the standby
so that we don't need to wait.

Implemented that way in V6 attached and that works fine.

Besides, would it be better to wait for the cascading standby? If the wal log
file needed for cascading standby is removed on the standby, the subsequent test
will fail.

Good catch! I agree that we have to wait on the cascading standby before removing
the WAL files. It's done in V6 (and the test is not failing anymore if we set a
recovery_min_apply_delay to 5s on the cascading standby).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v6-0001-Add-retained-WAL-test-in-035_standby_logical_deco.patchtext/plain; charset=UTF-8; name=v6-0001-Add-retained-WAL-test-in-035_standby_logical_deco.patchDownload

From 79f554eaf8185a2d34dc48ba31f1a3b3cd09c185 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 25 Apr 2023 06:02:17 +0000
Subject: [PATCH v6] Add retained WAL test in 035_standby_logical_decoding.pl

Adding one test, to verify that invalidated logical slots do not lead to
retaining WAL.
---
 .../t/035_standby_logical_decoding.pl         | 69 ++++++++++++++++++-
 1 file changed, 67 insertions(+), 2 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index f6d6447412..ea9ca46995 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -495,9 +495,74 @@ $node_standby->restart;
 check_slots_conflicting_status(1);
 
 ##################################################
-# Verify that invalidated logical slots do not lead to retaining WAL
+# Verify that invalidated logical slots do not lead to retaining WAL.
 ##################################################
-# XXXXX TODO
+
+# Before removing WAL files, ensure the cascading standby catch up
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# Get the restart_lsn from an invalidated slot
+my $restart_lsn = $node_standby->safe_psql('postgres',
+	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_activeslot' and conflicting is true;"
+);
+
+chomp($restart_lsn);
+
+# Get the WAL file name associated to this lsn on the primary
+my $walfile_name = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name('$restart_lsn')");
+
+chomp($walfile_name);
+
+# Check the WAL file is still on the primary
+ok(-f $node_primary->data_dir . '/pg_wal/' . $walfile_name,
+	"WAL file still on the primary");
+
+# Get the number of WAL files on the standby
+my $nb_standby_files = $node_standby->safe_psql('postgres',
+	"SELECT COUNT(*) FROM pg_ls_dir('pg_wal')");
+
+chomp($nb_standby_files);
+
+# Switch WAL files on the primary
+my @c = (1 .. $nb_standby_files);
+
+$node_primary->safe_psql('postgres', "create table retain_test(a int)");
+
+for (@c)
+{
+	$node_primary->safe_psql(
+		'postgres', "SELECT pg_switch_wal();
+                                          insert into retain_test values("
+		  . $_ . ");");
+}
+
+# Ask for a checkpoint
+$node_primary->safe_psql('postgres', 'checkpoint;');
+
+# Check that the WAL file has not been retained on the primary
+ok(!-f $node_primary->data_dir . '/pg_wal/' . $walfile_name,
+	"WAL file not on the primary anymore");
+
+# Wait for the standby to catch up
+$node_primary->wait_for_catchup($node_standby);
+
+# Generate another WAL switch, more activity and a checkpoint
+$node_primary->safe_psql(
+	'postgres', "SELECT pg_switch_wal();
+                                         insert into retain_test values(1);");
+$node_primary->safe_psql('postgres', 'checkpoint;');
+
+# Wait for the standby to catch up
+$node_primary->wait_for_catchup($node_standby);
+
+# Request a checkpoint on the standby to trigger the WAL file(s) removal
+$node_standby->safe_psql('postgres', 'checkpoint;');
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
+ok( !-f "$standby_walfile",
+    "invalidated logical slots do not lead to retaining WAL");
 
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
-- 
2.34.1

#23

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#20)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Wed, Apr 26, 2023 at 4:41 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 4/26/23 12:27 PM, Alvaro Herrera wrote:

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 6f7f4e5de4..819667d42a 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2644,7 +2644,16 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
-            $target_lsn = $self->lsn('write');
+            my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+            chomp($isrecovery);
+            if ($isrecovery eq 't')
+            {
+                    $target_lsn = $self->lsn('replay');
+            }
+            else
+            {
+                    $target_lsn = $self->lsn('write');
+            }

Please modify the function's documentation to account for this code change.

Good point, thanks! Done in V6 attached.

+When in recovery, the default value of target_lsn is $node->lsn('replay')
+instead. This is needed when the publisher passed to
wait_for_subscription_sync()
+is a standby node.

I think this will be useful whenever wait_for_catchup has been called
for a standby node (where self is a standby node). I have tried even
by commenting wait_for_subscription_sync in the new test then it fails
for $node_standby->wait_for_catchup('tap_sub');. So instead, how about
a comment like: "When in recovery, the default value of target_lsn is
$node->lsn('replay') instead which ensures that the cascaded standby
has caught up to what has been replayed on the standby."?

--
With Regards,
Amit Kapila.

#24

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#23)

1 attachment(s)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 4/27/23 5:37 AM, Amit Kapila wrote:

On Wed, Apr 26, 2023 at 4:41 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
+When in recovery, the default value of target_lsn is $node->lsn('replay')
+instead. This is needed when the publisher passed to
wait_for_subscription_sync()
+is a standby node.
I think this will be useful whenever wait_for_catchup has been called
for a standby node (where self is a standby node). I have tried even
by commenting wait_for_subscription_sync in the new test then it fails
for $node_standby->wait_for_catchup('tap_sub');. So instead, how about
a comment like: "When in recovery, the default value of target_lsn is
$node->lsn('replay') instead which ensures that the cascaded standby
has caught up to what has been replayed on the standby."?

I did it that way because wait_for_subscription_sync() was the first case I had
to work on but I do agree that your wording better describe the intend of the new
code.

Changed in V7 attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v7-0001-Add-subscription-to-the-standby-test-in-035_stand.patchtext/plain; charset=UTF-8; name=v7-0001-Add-subscription-to-the-standby-test-in-035_stand.patchDownload

From af8ab8460a70fe9b70836be021a1a98b2ba6ddd3 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Apr 2023 05:13:23 +0000
Subject: [PATCH v7] Add subscription to the standby test in
 035_standby_logical_decoding.pl

Adding one test, to verify that subscription to the standby is possible.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      | 21 ++++-
 .../t/035_standby_logical_decoding.pl         | 91 ++++++++++++++++++-
 2 files changed, 107 insertions(+), 5 deletions(-)
  18.9% src/test/perl/PostgreSQL/Test/
  81.0% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 6f7f4e5de4..bc9b5dc644 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2611,8 +2611,14 @@ When doing physical replication, the standby is usually identified by
 passing its PostgreSQL::Test::Cluster instance.  When doing logical
 replication, standby_name identifies a subscription.
 
-The default value of target_lsn is $node->lsn('write'), which ensures
-that the standby has caught up to what has been committed on the primary.
+When not in recovery, the default value of target_lsn is $node->lsn('write'),
+which ensures that the standby has caught up to what has been committed on
+the primary.
+
+When in recovery, the default value of target_lsn is $node->lsn('replay')
+instead which ensures that the cascaded standby has caught up to what has been
+replayed on the standby.
+
 If you pass an explicit value of target_lsn, it should almost always be
 the primary's write LSN; so this parameter is seldom needed except when
 querying some intermediate replication node rather than the primary.
@@ -2644,7 +2650,16 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		$target_lsn = $self->lsn('write');
+		my $isrecovery = $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($isrecovery);
+		if ($isrecovery eq 't')
+		{
+			$target_lsn = $self->lsn('replay');
+		}
+		else
+		{
+			$target_lsn = $self->lsn('write');
+		}
 	}
 	print "Waiting for replication conn "
 	  . $standby_name . "'s "
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index b8f5311fe9..f6d6447412 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -10,12 +10,17 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
-my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+my ($stdin,             $stdout,            $stderr,
+	$cascading_stdout,  $cascading_stderr,  $subscriber_stdin,
+	$subscriber_stdout, $subscriber_stderr, $ret,
+	$handle,            $slot);
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 my $node_standby = PostgreSQL::Test::Cluster->new('standby');
 my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
 my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $psql_timeout    = IPC::Run::timer($default_timeout);
 my $res;
 
 # Name for the physical slot on primary
@@ -267,7 +272,8 @@ $node_standby->init_from_backup(
 	has_streaming => 1,
 	has_restoring => 1);
 $node_standby->append_conf('postgresql.conf',
-	qq[primary_slot_name = '$primary_slotname']);
+	qq[primary_slot_name = '$primary_slotname'
+       max_replication_slots = 5]);
 $node_standby->start;
 $node_primary->wait_for_replay_catchup($node_standby);
 $node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
@@ -285,6 +291,26 @@ $node_cascading_standby->append_conf('postgresql.conf',
 $node_cascading_standby->start;
 $node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
 
+#######################
+# Initialize subscriber node
+#######################
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+my %psql_subscriber = (
+	'subscriber_stdin'  => '',
+	'subscriber_stdout' => '',
+	'subscriber_stderr' => '');
+$psql_subscriber{run} = IPC::Run::start(
+	[ 'psql', '-XA', '-f', '-', '-d', $node_subscriber->connstr('postgres') ],
+	'<',
+	\$psql_subscriber{subscriber_stdin},
+	'>',
+	\$psql_subscriber{subscriber_stdout},
+	'2>',
+	\$psql_subscriber{subscriber_stderr},
+	$psql_timeout);
+
 ##################################################
 # Test that logical decoding on the standby
 # behaves correctly.
@@ -365,6 +391,67 @@ is( $node_primary->psql(
     3,
     'replaying logical slot from another database fails');
 
+##################################################
+# Test that we can subscribe on the standby with the publication
+# created on the primary.
+##################################################
+
+# Create a table on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE tab_rep (a int primary key)");
+
+# Create a table (same structure) on the subscriber node
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_rep (a int primary key)");
+
+# Create a publication on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub for table tab_rep");
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Subscribe on the standby
+my $standby_connstr = $node_standby->connstr . ' dbname=postgres';
+
+# Not using safe_psql() here as it would wait for activity on the primary
+# and we wouldn't be able to launch pg_log_standby_snapshot() on the primary
+# while waiting.
+# psql_subscriber() allows to not wait synchronously.
+$psql_subscriber{subscriber_stdin} .=
+  qq[CREATE SUBSCRIPTION tap_sub
+     CONNECTION '$standby_connstr'
+     PUBLICATION tap_pub
+     WITH (copy_data = off);];
+$psql_subscriber{subscriber_stdin} .= "\n";
+
+$psql_subscriber{run}->pump_nb();
+
+# Speed up the subscription creation
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+
+# Explicitly shut down psql instance gracefully - to avoid hangs
+# or worse on windows
+$psql_subscriber{subscriber_stdin} .= "\\q\n";
+$psql_subscriber{run}->finish;
+
+$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');
+
+# Insert some rows on the primary
+$node_primary->safe_psql('postgres',
+	qq[INSERT INTO tab_rep select generate_series(1,10);]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_catchup('tap_sub');
+
+# Check that the subscriber can see the rows inserted in the primary
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_rep");
+is($result, qq(10), 'check replicated inserts after subscription on standby');
+
+# We do not need the subscription and the subscriber anymore
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+$node_subscriber->stop;
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
-- 
2.34.1

#25

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#24)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Thu, Apr 27, 2023 at 1:05 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 4/27/23 5:37 AM, Amit Kapila wrote:
On Wed, Apr 26, 2023 at 4:41 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
+When in recovery, the default value of target_lsn is $node->lsn('replay')
+instead. This is needed when the publisher passed to
wait_for_subscription_sync()
+is a standby node.
I think this will be useful whenever wait_for_catchup has been called
for a standby node (where self is a standby node). I have tried even
by commenting wait_for_subscription_sync in the new test then it fails
for $node_standby->wait_for_catchup('tap_sub');. So instead, how about
a comment like: "When in recovery, the default value of target_lsn is
$node->lsn('replay') instead which ensures that the cascaded standby
has caught up to what has been replayed on the standby."?
I did it that way because wait_for_subscription_sync() was the first case I had
to work on but I do agree that your wording better describe the intend of the new
code.

Changed in V7 attached.

Pushed.

--
With Regards,
Amit Kapila.

#26

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#25)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On 4/27/23 11:54 AM, Amit Kapila wrote:

On Thu, Apr 27, 2023 at 1:05 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
On 4/27/23 5:37 AM, Amit Kapila wrote:
On Wed, Apr 26, 2023 at 4:41 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
+When in recovery, the default value of target_lsn is $node->lsn('replay')
+instead. This is needed when the publisher passed to
wait_for_subscription_sync()
+is a standby node.
I think this will be useful whenever wait_for_catchup has been called
for a standby node (where self is a standby node). I have tried even
by commenting wait_for_subscription_sync in the new test then it fails
for $node_standby->wait_for_catchup('tap_sub');. So instead, how about
a comment like: "When in recovery, the default value of target_lsn is
$node->lsn('replay') instead which ensures that the cascaded standby
has caught up to what has been replayed on the standby."?
I did it that way because wait_for_subscription_sync() was the first case I had
to work on but I do agree that your wording better describe the intend of the new
code.

Changed in V7 attached.
Pushed.

Thanks!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#27

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#22)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Wed, Apr 26, 2023 at 7:53 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Besides, would it be better to wait for the cascading standby? If the wal log
file needed for cascading standby is removed on the standby, the subsequent test
will fail.

Good catch! I agree that we have to wait on the cascading standby before removing
the WAL files. It's done in V6 (and the test is not failing anymore if we set a
recovery_min_apply_delay to 5s on the cascading standby).

+# Get the restart_lsn from an invalidated slot
+my $restart_lsn = $node_standby->safe_psql('postgres',
+ "SELECT restart_lsn from pg_replication_slots WHERE slot_name =
'vacuum_full_activeslot' and conflicting is true;"
+);
+
+chomp($restart_lsn);
+
+# Get the WAL file name associated to this lsn on the primary
+my $walfile_name = $node_primary->safe_psql('postgres',
+ "SELECT pg_walfile_name('$restart_lsn')");
+
+chomp($walfile_name);
+
+# Check the WAL file is still on the primary
+ok(-f $node_primary->data_dir . '/pg_wal/' . $walfile_name,
+ "WAL file still on the primary");

How is it guaranteed that the WAL file corresponding to the
invalidated slot on standby will still be present on primary? Can you
please explain the logic behind this test a bit more like how the WAL
file switch helps you to achieve the purpose?

--
With Regards,
Amit Kapila.

#28

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#27)

1 attachment(s)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 4/28/23 5:55 AM, Amit Kapila wrote:

On Wed, Apr 26, 2023 at 7:53 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

+# Get the restart_lsn from an invalidated slot
+my $restart_lsn = $node_standby->safe_psql('postgres',
+ "SELECT restart_lsn from pg_replication_slots WHERE slot_name =
'vacuum_full_activeslot' and conflicting is true;"
+);
+
+chomp($restart_lsn);
+
+# Get the WAL file name associated to this lsn on the primary
+my $walfile_name = $node_primary->safe_psql('postgres',
+ "SELECT pg_walfile_name('$restart_lsn')");
+
+chomp($walfile_name);
+
+# Check the WAL file is still on the primary
+ok(-f $node_primary->data_dir . '/pg_wal/' . $walfile_name,
+ "WAL file still on the primary");

How is it guaranteed that the WAL file corresponding to the
invalidated slot on standby will still be present on primary?

The slot(s) have been invalidated by the "vacuum full" test just above
this one. So I think the WAL we are looking for is the last one being used
by the primary. As no activity happened on it since the vacuum full it looks to
me that it should still be present.

But I may have missed something and maybe that's not guarantee that this WAL is still there in all the cases.
In that case I think it's better to remove this test (it does not provide added value here).

Test removed in V7 attached.

Can you
please explain the logic behind this test a bit more like how the WAL
file switch helps you to achieve the purpose?

The idea was to generate enough "wal switch" on the primary to ensure
the WAL file has been removed.

I gave another thought on it and I think we can skip the test that the WAL is
not on the primary any more. That way, one "wal switch" seems to be enough
to see it removed on the standby.

It's done in V7.

V7 is not doing "extra tests" than necessary and I think it's probably better like this.

I can see V7 failing on "Cirrus CI / macOS - Ventura - Meson" only (other machines are not complaining).

It does fail on "invalidated logical slots do not lead to retaining WAL", see https://cirrus-ci.com/task/4518083541336064

I'm not sure why it is failing, any idea?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v7-0001-Add-a-test-to-verify-that-invalidated-logical-slo.patchtext/plain; charset=UTF-8; name=v7-0001-Add-a-test-to-verify-that-invalidated-logical-slo.patchDownload

From 2ab08214415023505244c954a6a5ebf42ec9aebb Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Fri, 28 Apr 2023 07:27:20 +0000
Subject: [PATCH v7] Add a test to verify that invalidated logical slots do not
 lead to retaining WAL.

---
 .../t/035_standby_logical_decoding.pl         | 39 ++++++++++++++++++-
 1 file changed, 37 insertions(+), 2 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 66d264f230..b32c1002b0 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -500,9 +500,44 @@ $node_standby->restart;
 check_slots_conflicting_status(1);
 
 ##################################################
-# Verify that invalidated logical slots do not lead to retaining WAL
+# Verify that invalidated logical slots do not lead to retaining WAL.
 ##################################################
-# XXXXX TODO
+
+# Before removing WAL file(s), ensure the cascading standby catch up
+$node_standby->wait_for_replay_catchup($node_cascading_standby,
+	$node_primary);
+
+# Get the restart_lsn from an invalidated slot
+my $restart_lsn = $node_standby->safe_psql('postgres',
+	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_activeslot' and conflicting is true;"
+);
+
+chomp($restart_lsn);
+
+# As pg_walfile_name() can not be executed on the standby,
+# get the WAL file name associated to this lsn from the primary
+my $walfile_name = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name('$restart_lsn')");
+
+chomp($walfile_name);
+
+# Generate some activity and switch WAL file on the primary
+$node_primary->safe_psql(
+	'postgres', "create table retain_test(a int);
+									 insert into retain_test values(1);
+									 select pg_switch_wal();
+									 checkpoint;");
+
+# Wait for the standby to catch up
+$node_primary->wait_for_catchup($node_standby);
+
+# Request a checkpoint on the standby to trigger the WAL file(s) removal
+$node_standby->safe_psql('postgres', 'checkpoint;');
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
+ok(!-f "$standby_walfile",
+	"invalidated logical slots do not lead to retaining WAL");
 
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
-- 
2.34.1

#29

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#28)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Fri, Apr 28, 2023 at 2:24 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Can you
please explain the logic behind this test a bit more like how the WAL
file switch helps you to achieve the purpose?

The idea was to generate enough "wal switch" on the primary to ensure
the WAL file has been removed.

I gave another thought on it and I think we can skip the test that the WAL is
not on the primary any more. That way, one "wal switch" seems to be enough
to see it removed on the standby.

It's done in V7.

V7 is not doing "extra tests" than necessary and I think it's probably better like this.

I can see V7 failing on "Cirrus CI / macOS - Ventura - Meson" only (other machines are not complaining).

It does fail on "invalidated logical slots do not lead to retaining WAL", see https://cirrus-ci.com/task/4518083541336064

I'm not sure why it is failing, any idea?

I think the reason for the failure is that on standby, the test is not
able to remove the file corresponding to the invalid slot. You are
using pg_switch_wal() to generate a switch record and I think you need
one more WAL-generating statement after that to achieve your purpose
which is that during checkpoint, the tes removes the WAL file
corresponding to an invalid slot. Just doing checkpoint on primary may
not serve the need as that doesn't lead to any new insertion of WAL on
standby. Is your v6 failing in the same environment? If not, then it
is probably due to the reason that the test is doing insert after
pg_switch_wal() in that version. Why did you change the order of
insert in v7?

BTW, you can confirm the failure by changing the DEBUG2 message in
RemoveOldXlogFiles() to LOG. In the case, where the test fails, it may
not remove the WAL file corresponding to an invalid slot whereas it
will remove the WAL file when the test succeeds.

--
With Regards,
Amit Kapila.

#30

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#29)

1 attachment(s)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 5/2/23 8:28 AM, Amit Kapila wrote:

On Fri, Apr 28, 2023 at 2:24 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

I can see V7 failing on "Cirrus CI / macOS - Ventura - Meson" only (other machines are not complaining).

It does fail on "invalidated logical slots do not lead to retaining WAL", see https://cirrus-ci.com/task/4518083541336064

I'm not sure why it is failing, any idea?

I think the reason for the failure is that on standby, the test is not
able to remove the file corresponding to the invalid slot. You are
using pg_switch_wal() to generate a switch record and I think you need
one more WAL-generating statement after that to achieve your purpose
which is that during checkpoint, the tes removes the WAL file
corresponding to an invalid slot. Just doing checkpoint on primary may
not serve the need as that doesn't lead to any new insertion of WAL on
standby. Is your v6 failing in the same environment?

Thanks for the feedback!

No V6 was working fine.

If not, then it
is probably due to the reason that the test is doing insert after
pg_switch_wal() in that version. Why did you change the order of
insert in v7?

I thought doing the insert before the switch was ok and as my local test
was running fine I did not re-consider the ordering.

BTW, you can confirm the failure by changing the DEBUG2 message in
RemoveOldXlogFiles() to LOG. In the case, where the test fails, it may
not remove the WAL file corresponding to an invalid slot whereas it
will remove the WAL file when the test succeeds.

Yeah, I added more debug information and what I can see is that the WAL file
we want to see removed is "000000010000000000000003" while the standby emits:

"
2023-05-02 10:03:28.351 UTC [16971][checkpointer] LOG: attempting to remove WAL segments older than log file 000000000000000000000002
2023-05-02 10:03:28.351 UTC [16971][checkpointer] LOG: recycled write-ahead log file "000000010000000000000002"
"

As per your suggestion, changing the insert ordering (like in V8 attached) makes it now work on the failing environment too.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v8-0001-Add-a-test-to-verify-that-invalidated-logical-slo.patchtext/plain; charset=UTF-8; name=v8-0001-Add-a-test-to-verify-that-invalidated-logical-slo.patchDownload

From defaeb000b09eab9ba7b5d08c032f81b5bd72a53 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Fri, 28 Apr 2023 07:27:20 +0000
Subject: [PATCH v8] Add a test to verify that invalidated logical slots do not
 lead to retaining WAL.

---
 .../t/035_standby_logical_decoding.pl         | 39 ++++++++++++++++++-
 1 file changed, 37 insertions(+), 2 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 66d264f230..a6b640d7fd 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -500,9 +500,44 @@ $node_standby->restart;
 check_slots_conflicting_status(1);
 
 ##################################################
-# Verify that invalidated logical slots do not lead to retaining WAL
+# Verify that invalidated logical slots do not lead to retaining WAL.
 ##################################################
-# XXXXX TODO
+
+# Before removing WAL file(s), ensure the cascading standby catch up
+$node_standby->wait_for_replay_catchup($node_cascading_standby,
+	$node_primary);
+
+# Get the restart_lsn from an invalidated slot
+my $restart_lsn = $node_standby->safe_psql('postgres',
+	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_activeslot' and conflicting is true;"
+);
+
+chomp($restart_lsn);
+
+# As pg_walfile_name() can not be executed on the standby,
+# get the WAL file name associated to this lsn from the primary
+my $walfile_name = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name('$restart_lsn')");
+
+chomp($walfile_name);
+
+# Generate some activity and switch WAL file on the primary
+$node_primary->safe_psql(
+	'postgres', "create table retain_test(a int);
+									 select pg_switch_wal();
+									 insert into retain_test values(1);
+									 checkpoint;");
+
+# Wait for the standby to catch up
+$node_primary->wait_for_catchup($node_standby);
+
+# Request a checkpoint on the standby to trigger the WAL file(s) removal
+$node_standby->safe_psql('postgres', 'checkpoint;');
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
+ok(!-f "$standby_walfile",
+	"invalidated logical slots do not lead to retaining WAL");
 
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
-- 
2.34.1

#31

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#30)

1 attachment(s)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Tue, May 2, 2023 at 4:52 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

As per your suggestion, changing the insert ordering (like in V8 attached) makes it now work on the failing environment too.

I think it is better to use wait_for_replay_catchup() to wait for
standby to catch up. I have changed that and a comment in the
attached. I'll push this tomorrow unless there are further comments.

--
With Regards,
Amit Kapila.

Attachments:

v9-0001-Test-that-invalidated-logical-slots-doesn-t-retai.patchapplication/octet-stream; name=v9-0001-Test-that-invalidated-logical-slots-doesn-t-retai.patchDownload

From 78e14e490208e4777bb6e4f8ea69a121e9bb432e Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 3 May 2023 15:14:53 +0530
Subject: [PATCH v9] Test that invalidated logical slots doesn't retain WAL.

Author: Bertrand Drouvot
Reviewed-by: Shi yu, Amit Kapila
Discussion: https://postgr.es/m/2fefa454-5a70-2174-ddbf-4a0e41537139@gmail.com
---
 .../t/035_standby_logical_decoding.pl         | 38 ++++++++++++++++++-
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index b8f5311fe9..df92279c8f 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -408,9 +408,43 @@ $node_standby->restart;
 check_slots_conflicting_status(1);
 
 ##################################################
-# Verify that invalidated logical slots do not lead to retaining WAL
+# Verify that invalidated logical slots do not lead to retaining WAL.
 ##################################################
-# XXXXX TODO
+
+# Wait for the cascading standby to catchup before removing the WAL file(s)
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# Get the restart_lsn from an invalidated slot
+my $restart_lsn = $node_standby->safe_psql('postgres',
+	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_activeslot' and conflicting is true;"
+);
+
+chomp($restart_lsn);
+
+# As pg_walfile_name() can not be executed on the standby,
+# get the WAL file name associated to this lsn from the primary
+my $walfile_name = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name('$restart_lsn')");
+
+chomp($walfile_name);
+
+# Generate some activity and switch WAL file on the primary
+$node_primary->safe_psql(
+	'postgres', "create table retain_test(a int);
+									 select pg_switch_wal();
+									 insert into retain_test values(1);
+									 checkpoint;");
+
+# Wait for the standby to catch up
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Request a checkpoint on the standby to trigger the WAL file(s) removal
+$node_standby->safe_psql('postgres', 'checkpoint;');
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
+ok(!-f "$standby_walfile",
+	"invalidated logical slots do not lead to retaining WAL");
 
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
-- 
2.28.0.windows.1

#32

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#31)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 5/3/23 12:29 PM, Amit Kapila wrote:

On Tue, May 2, 2023 at 4:52 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

As per your suggestion, changing the insert ordering (like in V8 attached) makes it now work on the failing environment too.

I think it is better to use wait_for_replay_catchup() to wait for
standby to catch up.

Oh right, that's a discussion we already had in [1]/messages/by-id/acbac69e-9ae8-c546-3216-8ecb38e7a93d@gmail.com, I should have thought about it.

I have changed that and a comment in the
attached. I'll push this tomorrow unless there are further comments.

LGTM, thanks!

[1]: /messages/by-id/acbac69e-9ae8-c546-3216-8ecb38e7a93d@gmail.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#33

vignesh C

vignesh21@gmail.com

over 2 years ago

In reply to: Amit Kapila (#31)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Wed, 3 May 2023 at 15:59, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, May 2, 2023 at 4:52 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

As per your suggestion, changing the insert ordering (like in V8 attached) makes it now work on the failing environment too.

I think it is better to use wait_for_replay_catchup() to wait for
standby to catch up. I have changed that and a comment in the
attached. I'll push this tomorrow unless there are further comments.

Thanks for posting the updated patch, I had run this test in a loop of
100 times to verify that there was no failure because of race
conditions. The 100 times execution passed successfully.

One suggestion:
"wal file" should be changed to "WAL file":
+# Request a checkpoint on the standby to trigger the WAL file(s) removal
+$node_standby->safe_psql('postgres', 'checkpoint;');
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;

Regards,
Vignesh

#34

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: vignesh C (#33)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote:

Thanks for posting the updated patch, I had run this test in a loop of
100 times to verify that there was no failure because of race
conditions. The 100 times execution passed successfully.
One suggestion:
"wal file" should be changed to "WAL file":
+# Request a checkpoint on the standby to trigger the WAL file(s) removal
+$node_standby->safe_psql('postgres', 'checkpoint;');
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;

Thanks for the verification. I have pushed the patch.

--
With Regards,
Amit Kapila.

#35

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#34)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 5/4/23 6:43 AM, Amit Kapila wrote:

On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote:
Thanks for posting the updated patch, I had run this test in a loop of
100 times to verify that there was no failure because of race
conditions. The 100 times execution passed successfully.
One suggestion:
"wal file" should be changed to "WAL file":
+# Request a checkpoint on the standby to trigger the WAL file(s) removal
+$node_standby->safe_psql('postgres', 'checkpoint;');
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
Thanks for the verification. I have pushed the patch.

Thanks!

I've marked the CF entry as Committed and moved the associated PostgreSQL 16 Open Item
to "resolved before 16beta1".

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#36

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#34)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 5/4/23 6:43 AM, Amit Kapila wrote:

On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote:
Thanks for posting the updated patch, I had run this test in a loop of
100 times to verify that there was no failure because of race
conditions. The 100 times execution passed successfully.
One suggestion:
"wal file" should be changed to "WAL file":
+# Request a checkpoint on the standby to trigger the WAL file(s) removal
+$node_standby->safe_psql('postgres', 'checkpoint;');
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
Thanks for the verification. I have pushed the patch.

It looks like there is still something wrong with this test as there
are a bunch of cfbot errors on this new test (mainly on macOS - Ventura - Meson).

I'll try to reproduce with more debug infos.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#37

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#36)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Fri, May 5, 2023 at 11:08 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 5/4/23 6:43 AM, Amit Kapila wrote:
On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote:
Thanks for posting the updated patch, I had run this test in a loop of
100 times to verify that there was no failure because of race
conditions. The 100 times execution passed successfully.
One suggestion:
"wal file" should be changed to "WAL file":
+# Request a checkpoint on the standby to trigger the WAL file(s) removal
+$node_standby->safe_psql('postgres', 'checkpoint;');
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
Thanks for the verification. I have pushed the patch.
It looks like there is still something wrong with this test as there
are a bunch of cfbot errors on this new test (mainly on macOS - Ventura - Meson).

Is it possible for you to point me to those failures?

I'll try to reproduce with more debug infos.

Okay, thanks!

--
With Regards,
Amit Kapila.

#38

vignesh C

vignesh21@gmail.com

over 2 years ago

In reply to: Amit Kapila (#37)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Fri, 5 May 2023 at 12:34, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 5, 2023 at 11:08 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
On 5/4/23 6:43 AM, Amit Kapila wrote:
On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote:
Thanks for posting the updated patch, I had run this test in a loop of
100 times to verify that there was no failure because of race
conditions. The 100 times execution passed successfully.
One suggestion:
"wal file" should be changed to "WAL file":
+# Request a checkpoint on the standby to trigger the WAL file(s) removal
+$node_standby->safe_psql('postgres', 'checkpoint;');
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
Thanks for the verification. I have pushed the patch.
It looks like there is still something wrong with this test as there
are a bunch of cfbot errors on this new test (mainly on macOS - Ventura - Meson).
Is it possible for you to point me to those failures?

I think these failures are occuring in CFBOT, once such instance is at:
https://cirrus-ci.com/task/6642271152504832?logs=test_world#L39
https://api.cirrus-ci.com/v1/artifact/task/6642271152504832/testrun/build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding

Regards,
Vignesh

#39

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: vignesh C (#38)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 5/5/23 9:11 AM, vignesh C wrote:

On Fri, 5 May 2023 at 12:34, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, May 5, 2023 at 11:08 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
On 5/4/23 6:43 AM, Amit Kapila wrote:
On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote:
Thanks for posting the updated patch, I had run this test in a loop of
100 times to verify that there was no failure because of race
conditions. The 100 times execution passed successfully.
One suggestion:
"wal file" should be changed to "WAL file":
+# Request a checkpoint on the standby to trigger the WAL file(s) removal
+$node_standby->safe_psql('postgres', 'checkpoint;');
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
Thanks for the verification. I have pushed the patch.
It looks like there is still something wrong with this test as there
are a bunch of cfbot errors on this new test (mainly on macOS - Ventura - Meson).
Is it possible for you to point me to those failures?
I think these failures are occuring in CFBOT, once such instance is at:
https://cirrus-ci.com/task/6642271152504832?logs=test_world#L39
https://api.cirrus-ci.com/v1/artifact/task/6642271152504832/testrun/build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding

Yeah, thanks, that's one of them.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#40

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#37)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On 5/5/23 9:04 AM, Amit Kapila wrote:

On Fri, May 5, 2023 at 11:08 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
On 5/4/23 6:43 AM, Amit Kapila wrote:
On Thu, May 4, 2023 at 8:37 AM vignesh C <vignesh21@gmail.com> wrote:
Thanks for posting the updated patch, I had run this test in a loop of
100 times to verify that there was no failure because of race
conditions. The 100 times execution passed successfully.
One suggestion:
"wal file" should be changed to "WAL file":
+# Request a checkpoint on the standby to trigger the WAL file(s) removal
+$node_standby->safe_psql('postgres', 'checkpoint;');
+
+# Verify that the wal file has not been retained on the standby
+my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
Thanks for the verification. I have pushed the patch.
It looks like there is still something wrong with this test as there
are a bunch of cfbot errors on this new test (mainly on macOS - Ventura - Meson).
Is it possible for you to point me to those failures?

I'll try to reproduce with more debug infos.

Okay, thanks!

After multiple attempts, I got one failing one.

Issue is that we expect this file to be removed:

[07:24:27.261](0.899s) #WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003

But the standby emits:

2023-05-05 07:24:27.216 UTC [17909][client backend] [035_standby_logical_decoding.pl][3/6:0] LOG: statement: checkpoint;
2023-05-05 07:24:27.216 UTC [17745][checkpointer] LOG: restartpoint starting: immediate wait
2023-05-05 07:24:27.259 UTC [17745][checkpointer] LOG: attempting to remove WAL segments older than log file 000000000000000000000002

So it seems the test is not right (missing activity??), not sure why yet.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#41

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#40)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Fri, May 5, 2023 at 1:16 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

After multiple attempts, I got one failing one.

Issue is that we expect this file to be removed:

[07:24:27.261](0.899s) #WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003

But the standby emits:

2023-05-05 07:24:27.216 UTC [17909][client backend] [035_standby_logical_decoding.pl][3/6:0] LOG: statement: checkpoint;
2023-05-05 07:24:27.216 UTC [17745][checkpointer] LOG: restartpoint starting: immediate wait
2023-05-05 07:24:27.259 UTC [17745][checkpointer] LOG: attempting to remove WAL segments older than log file 000000000000000000000002

So it seems the test is not right (missing activity??), not sure why yet.

Can you try to print the value returned by
XLogGetReplicationSlotMinimumLSN() in KeepLogSeg() on standby? Also,
please try to print "attempting to remove WAL segments ..." on the
primary. We can see, if by any chance some slot is holding us to
remove the required WAL file.

--
With Regards,
Amit Kapila.

#42

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Amit Kapila (#41)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Fri, May 5, 2023 at 2:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 5, 2023 at 1:16 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

After multiple attempts, I got one failing one.

Issue is that we expect this file to be removed:

[07:24:27.261](0.899s) #WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003

But the standby emits:

2023-05-05 07:24:27.216 UTC [17909][client backend] [035_standby_logical_decoding.pl][3/6:0] LOG: statement: checkpoint;
2023-05-05 07:24:27.216 UTC [17745][checkpointer] LOG: restartpoint starting: immediate wait
2023-05-05 07:24:27.259 UTC [17745][checkpointer] LOG: attempting to remove WAL segments older than log file 000000000000000000000002

So it seems the test is not right (missing activity??), not sure why yet.

Can you try to print the value returned by
XLogGetReplicationSlotMinimumLSN() in KeepLogSeg() on standby? Also,
please try to print "attempting to remove WAL segments ..." on the
primary. We can see, if by any chance some slot is holding us to
remove the required WAL file.

We can also probably check the values of 'endptr', 'receivePtr', and
replayPtr on standby in the below code:

CreateRestartPoint()
{
...
/*
* Retreat _logSegNo using the current end of xlog replayed or received,
* whichever is later.
*/
receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
replayPtr = GetXLogReplayRecPtr(&replayTLI);
endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
KeepLogSeg(endptr, &_logSegNo);
...
}

--
With Regards,
Amit Kapila.

#43

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#41)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On 5/5/23 11:29 AM, Amit Kapila wrote:

On Fri, May 5, 2023 at 1:16 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

After multiple attempts, I got one failing one.

Issue is that we expect this file to be removed:

[07:24:27.261](0.899s) #WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003

But the standby emits:

2023-05-05 07:24:27.216 UTC [17909][client backend] [035_standby_logical_decoding.pl][3/6:0] LOG: statement: checkpoint;
2023-05-05 07:24:27.216 UTC [17745][checkpointer] LOG: restartpoint starting: immediate wait
2023-05-05 07:24:27.259 UTC [17745][checkpointer] LOG: attempting to remove WAL segments older than log file 000000000000000000000002

So it seems the test is not right (missing activity??), not sure why yet.

Can you try to print the value returned by
XLogGetReplicationSlotMinimumLSN() in KeepLogSeg() on standby? Also,
please try to print "attempting to remove WAL segments ..." on the
primary. We can see, if by any chance some slot is holding us to
remove the required WAL file.

I turned DEBUG2 on. We can also see on the primary:

2023-05-05 08:23:30.843 UTC [16833][checkpointer] LOCATION: CheckPointReplicationSlots, slot.c:1576
2023-05-05 08:23:30.844 UTC [16833][checkpointer] DEBUG: 00000: snapshot of 0+0 running transaction ids (lsn 0/40000D0 oldest xid 746 latest complete 745 next xid 746)
2023-05-05 08:23:30.844 UTC [16833][checkpointer] LOCATION: LogCurrentRunningXacts, standby.c:1377
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOG: 00000: BDT1 about to call RemoveOldXlogFiles in CreateCheckPoint
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: CreateCheckPoint, xlog.c:6835
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOG: 00000: attempting to remove WAL segments older than log file 000000000000000000000002
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: RemoveOldXlogFiles, xlog.c:3560
2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000001"
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: RemoveXlogFile, xlog.c:3708
2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000002"
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: RemoveXlogFile, xlog.c:3708
2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG: 00000: SlruScanDirectory invoking callback on pg_subtrans/0000

So, 000000010000000000000003 is not removed on the primary.

It has been recycled on:

2023-05-05 08:23:38.605 UTC [16833][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000003"

Which is later than the test:

[08:23:31.931](0.000s) not ok 19 - invalidated logical slots do not lead to retaining WAL

FWIW, the failing test with DEBUG2 can be found there: https://cirrus-ci.com/task/5615316688961536

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#44

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#43)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Fri, May 5, 2023 at 4:02 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 5/5/23 11:29 AM, Amit Kapila wrote:

On Fri, May 5, 2023 at 1:16 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

After multiple attempts, I got one failing one.

Issue is that we expect this file to be removed:

[07:24:27.261](0.899s) #WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003

But the standby emits:

2023-05-05 07:24:27.216 UTC [17909][client backend] [035_standby_logical_decoding.pl][3/6:0] LOG: statement: checkpoint;
2023-05-05 07:24:27.216 UTC [17745][checkpointer] LOG: restartpoint starting: immediate wait
2023-05-05 07:24:27.259 UTC [17745][checkpointer] LOG: attempting to remove WAL segments older than log file 000000000000000000000002

So it seems the test is not right (missing activity??), not sure why yet.

Can you try to print the value returned by
XLogGetReplicationSlotMinimumLSN() in KeepLogSeg() on standby? Also,
please try to print "attempting to remove WAL segments ..." on the
primary. We can see, if by any chance some slot is holding us to
remove the required WAL file.

I turned DEBUG2 on. We can also see on the primary:

2023-05-05 08:23:30.843 UTC [16833][checkpointer] LOCATION: CheckPointReplicationSlots, slot.c:1576
2023-05-05 08:23:30.844 UTC [16833][checkpointer] DEBUG: 00000: snapshot of 0+0 running transaction ids (lsn 0/40000D0 oldest xid 746 latest complete 745 next xid 746)
2023-05-05 08:23:30.844 UTC [16833][checkpointer] LOCATION: LogCurrentRunningXacts, standby.c:1377
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOG: 00000: BDT1 about to call RemoveOldXlogFiles in CreateCheckPoint
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: CreateCheckPoint, xlog.c:6835
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOG: 00000: attempting to remove WAL segments older than log file 000000000000000000000002
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: RemoveOldXlogFiles, xlog.c:3560
2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000001"
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: RemoveXlogFile, xlog.c:3708
2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000002"
2023-05-05 08:23:30.845 UTC [16833][checkpointer] LOCATION: RemoveXlogFile, xlog.c:3708
2023-05-05 08:23:30.845 UTC [16833][checkpointer] DEBUG: 00000: SlruScanDirectory invoking callback on pg_subtrans/0000

So, 000000010000000000000003 is not removed on the primary.

How did you concluded that 000000010000000000000003 is the file the
test is expecting to be removed?

--
With Regards,
Amit Kapila.

#45

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#44)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On 5/5/23 12:58 PM, Amit Kapila wrote:

On Fri, May 5, 2023 at 4:02 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

How did you concluded that 000000010000000000000003 is the file the
test is expecting to be removed?

because I added a note in the test that way:

"
@@ -535,6 +539,7 @@ $node_standby->safe_psql('postgres', 'checkpoint;');

# Verify that the WAL file has not been retained on the standby
my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
+note "BDT WAL file is $standby_walfile";
ok(!-f "$standby_walfile",
"invalidated logical slots do not lead to retaining WAL");
"

so that I can check in the test log file:

grep "WAL file is" ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding
[08:23:31.931](2.217s) # BDT WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#46

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#45)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Fri, May 5, 2023 at 5:36 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 5/5/23 12:58 PM, Amit Kapila wrote:

On Fri, May 5, 2023 at 4:02 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

How did you concluded that 000000010000000000000003 is the file the
test is expecting to be removed?

because I added a note in the test that way:

"
@@ -535,6 +539,7 @@ $node_standby->safe_psql('postgres', 'checkpoint;');

# Verify that the WAL file has not been retained on the standby
my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
+note "BDT WAL file is $standby_walfile";
ok(!-f "$standby_walfile",
"invalidated logical slots do not lead to retaining WAL");
"

so that I can check in the test log file:

grep "WAL file is" ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding
[08:23:31.931](2.217s) # BDT WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003

It seems due to some reason the current wal file is not switched due
to some reason. I think we need to add more DEBUG info to find that
out. Can you please try to print 'RedoRecPtr', '_logSegNo', and
recptr?

/*
* Delete old log files, those no longer needed for last checkpoint to
* prevent the disk holding the xlog from growing full.
*/
XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
KeepLogSeg(recptr, &_logSegNo);

--
With Regards,
Amit Kapila.

#47

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#46)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On 5/5/23 2:28 PM, Amit Kapila wrote:

On Fri, May 5, 2023 at 5:36 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

It seems due to some reason the current wal file is not switched due
to some reason.

Oh wait, here is a NON failing one: https://cirrus-ci.com/task/5086849685782528 (I modified the
.cirrus.yml so that we can download the "testrun.zip" file even if the test is not failing).

So, in this testrun.zip we can see, that the test is ok:

$ grep -i retain ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding
[10:06:08.789](0.000s) ok 19 - invalidated logical slots do not lead to retaining WAL

and that the WAL file we expect to be removed is:

$ grep "WAL file is" ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding
[10:06:08.789](0.925s) # BDT WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003

This WAL file has been removed by the standby:

$ grep -i 000000010000000000000003 ./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_standby.log | grep -i recy
2023-05-05 10:06:08.787 UTC [17521][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000003"

But on the primary, it has been recycled way after that time:

$ grep -i 000000010000000000000003 ./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_primary.log | grep -i recy
2023-05-05 10:06:13.370 UTC [16785][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000003"

As, the checkpoint on the primary after the WAL file switch only recycled (001 and 002):

$ grep -i recycled ./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_primary.log
2023-05-05 10:05:57.196 UTC [16785][checkpointer] LOG: 00000: checkpoint complete: wrote 4 buffers (3.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.027 s; sync files=0, longest=0.000 s, average=0.000 s; distance=11219 kB, estimate=11219 kB; lsn=0/2000060, redo lsn=0/2000028
2023-05-05 10:06:08.138 UTC [16785][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000001"
2023-05-05 10:06:08.138 UTC [16785][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000002"
2023-05-05 10:06:08.138 UTC [16785][checkpointer] LOG: 00000: checkpoint complete: wrote 20 buffers (15.6%); 0 WAL file(s) added, 0 removed, 2 recycled; write=0.001 s, sync=0.001 s, total=0.003 s; sync files=0, longest=0.000 s, average=0.000 s; distance=32768 kB, estimate=32768 kB; lsn=0/40000D0, redo lsn=0/4000098

So, even on a successful test, we can see that the WAL file we expect to be removed on the standby has not been recycled on the primary before the test.

I think we need to add more DEBUG info to find that
out. Can you please try to print 'RedoRecPtr', '_logSegNo', and
recptr?

Yes, will do.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#48

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#47)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Fri, May 5, 2023 at 7:53 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 5/5/23 2:28 PM, Amit Kapila wrote:

On Fri, May 5, 2023 at 5:36 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

It seems due to some reason the current wal file is not switched due
to some reason.

Oh wait, here is a NON failing one: https://cirrus-ci.com/task/5086849685782528 (I modified the
.cirrus.yml so that we can download the "testrun.zip" file even if the test is not failing).

So, in this testrun.zip we can see, that the test is ok:

$ grep -i retain ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding
[10:06:08.789](0.000s) ok 19 - invalidated logical slots do not lead to retaining WAL

and that the WAL file we expect to be removed is:

$ grep "WAL file is" ./build/testrun/recovery/035_standby_logical_decoding/log/regress_log_035_standby_logical_decoding
[10:06:08.789](0.925s) # BDT WAL file is /Users/admin/pgsql/build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_wal/000000010000000000000003

This WAL file has been removed by the standby:

$ grep -i 000000010000000000000003 ./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_standby.log | grep -i recy
2023-05-05 10:06:08.787 UTC [17521][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000003"

But on the primary, it has been recycled way after that time:

$ grep -i 000000010000000000000003 ./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_primary.log | grep -i recy
2023-05-05 10:06:13.370 UTC [16785][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000003"

As, the checkpoint on the primary after the WAL file switch only recycled (001 and 002):

$ grep -i recycled ./build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_primary.log
2023-05-05 10:05:57.196 UTC [16785][checkpointer] LOG: 00000: checkpoint complete: wrote 4 buffers (3.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.027 s; sync files=0, longest=0.000 s, average=0.000 s; distance=11219 kB, estimate=11219 kB; lsn=0/2000060, redo lsn=0/2000028
2023-05-05 10:06:08.138 UTC [16785][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000001"
2023-05-05 10:06:08.138 UTC [16785][checkpointer] DEBUG: 00000: recycled write-ahead log file "000000010000000000000002"
2023-05-05 10:06:08.138 UTC [16785][checkpointer] LOG: 00000: checkpoint complete: wrote 20 buffers (15.6%); 0 WAL file(s) added, 0 removed, 2 recycled; write=0.001 s, sync=0.001 s, total=0.003 s; sync files=0, longest=0.000 s, average=0.000 s; distance=32768 kB, estimate=32768 kB; lsn=0/40000D0, redo lsn=0/4000098

So, even on a successful test, we can see that the WAL file we expect to be removed on the standby has not been recycled on the primary before the test.

Okay, one possibility of not removing on primary is that at the time
of checkpoint (when we compute RedoRecPtr), the wal_swtich and insert
is not yet performed because in that case it will compute the
RedoRecPtr as a location before those operations which would be 0000*3
file. However, it is not clear how is that possible except from a
background checkpoint happening at that point but from LOGs, it
appears that the checkpoint triggered by test has recycled the wal
files.

I think we need to add more DEBUG info to find that
out. Can you please try to print 'RedoRecPtr', '_logSegNo', and
recptr?

Yes, will do.

Okay, thanks, please try to print similar locations on standby in
CreateRestartPoint().

--
With Regards,
Amit Kapila.

#49

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#48)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 5/6/23 4:10 AM, Amit Kapila wrote:

On Fri, May 5, 2023 at 7:53 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 5/5/23 2:28 PM, Amit Kapila wrote:

On Fri, May 5, 2023 at 5:36 PM Drouvot, Bertrand

So, even on a successful test, we can see that the WAL file we expect to be removed on the standby has not been recycled on the primary before the test.

Okay, one possibility of not removing on primary is that at the time
of checkpoint (when we compute RedoRecPtr), the wal_swtich and insert
is not yet performed because in that case it will compute the
RedoRecPtr as a location before those operations which would be 0000*3
file. However, it is not clear how is that possible except from a
background checkpoint happening at that point but from LOGs, it
appears that the checkpoint triggered by test has recycled the wal
files.

I think we need to add more DEBUG info to find that
out. Can you please try to print 'RedoRecPtr', '_logSegNo', and
recptr?

Yes, will do.

Okay, thanks, please try to print similar locations on standby in
CreateRestartPoint().

The extra information is displayed that way:

https://github.com/bdrouvot/postgres/commit/a3d6d58d105b379c04a17a1129bfb709302588ca#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bR6822-R6830
https://github.com/bdrouvot/postgres/commit/a3d6d58d105b379c04a17a1129bfb709302588ca#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bR7269-R7271
https://github.com/bdrouvot/postgres/commit/a3d6d58d105b379c04a17a1129bfb709302588ca#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bR7281-R7284

There is 2 runs with this extra info in place:

A successful one: https://cirrus-ci.com/task/6528745436086272
A failed one: https://cirrus-ci.com/task/4558139312308224

For both the testrun.zip is available in the Artifacts section.

Sharing this now in case you want to have a look (I'll have a look at them early next week on my side).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#50

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#49)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

There is 2 runs with this extra info in place:

A successful one: https://cirrus-ci.com/task/6528745436086272
A failed one: https://cirrus-ci.com/task/4558139312308224

Thanks, I think I got some clue as to why this test is failing
randomly. Following is the comparison of successful and failed run
logs for standby:

Success case
============
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] DEBUG: 00000: write 0/4000148 flush
0/4000000 apply 0/4000000 reply_time 2023-05-06 07:23:05.496365+00
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] LOCATION: ProcessStandbyReplyMessage,
walsender.c:2101
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] DEBUG: 00000: write 0/4000148 flush
0/4000148 apply 0/4000000 reply_time 2023-05-06 07:23:05.4964+00
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] LOCATION: ProcessStandbyReplyMessage,
walsender.c:2101
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] DEBUG: 00000: write 0/4000148 flush
0/4000148 apply 0/4000148 reply_time 2023-05-06 07:23:05.496531+00
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] LOCATION: ProcessStandbyReplyMessage,
walsender.c:2101
2023-05-06 07:23:05.500 UTC [17706][client backend]
[035_standby_logical_decoding.pl][2/12:0] LOG: 00000: statement:
checkpoint;
2023-05-06 07:23:05.500 UTC [17706][client backend]
[035_standby_logical_decoding.pl][2/12:0] LOCATION:
exec_simple_query, postgres.c:1074
2023-05-06 07:23:05.500 UTC [17550][checkpointer] LOG: 00000:
restartpoint starting: immediate wait
...
...
2023-05-06 07:23:05.500 UTC [17550][checkpointer] LOCATION:
CheckPointReplicationSlots, slot.c:1576
2023-05-06 07:23:05.501 UTC [17550][checkpointer] DEBUG: 00000:
updated min recovery point to 0/4000148 on timeline 1
2023-05-06 07:23:05.501 UTC [17550][checkpointer] LOCATION:
UpdateMinRecoveryPoint, xlog.c:2500
2023-05-06 07:23:05.515 UTC [17550][checkpointer] LOG: 00000:
CreateRestartPoint: After XLByteToSeg RedoRecPtr is 0/4000098,
_logSegNo is 4
2023-05-06 07:23:05.515 UTC [17550][checkpointer] LOCATION:
CreateRestartPoint, xlog.c:7271
2023-05-06 07:23:05.515 UTC [17550][checkpointer] LOG: 00000:
CreateRestartPoint: After KeepLogSeg RedoRecPtr is 0/4000098, endptr
is 0/4000148, _logSegNo is 4

Failed case:
==========
2023-05-06 07:53:19.657 UTC [17914][walsender]
[cascading_standby][3/0:0] DEBUG: 00000: write 0/3D1A000 flush
0/3CFA000 apply 0/4000000 reply_time 2023-05-06 07:53:19.65207+00
2023-05-06 07:53:19.657 UTC [17914][walsender]
[cascading_standby][3/0:0] LOCATION: ProcessStandbyReplyMessage,
walsender.c:2101
2023-05-06 07:53:19.657 UTC [17914][walsender]
[cascading_standby][3/0:0] DEBUG: 00000: write 0/3D1A000 flush
0/3D1A000 apply 0/4000000 reply_time 2023-05-06 07:53:19.656471+00
2023-05-06 07:53:19.657 UTC [17914][walsender]
[cascading_standby][3/0:0] LOCATION: ProcessStandbyReplyMessage,
walsender.c:2101
...
...
2023-05-06 07:53:19.686 UTC [17881][checkpointer] DEBUG: 00000:
updated min recovery point to 0/4000148 on timeline 1
2023-05-06 07:53:19.686 UTC [17881][checkpointer] LOCATION:
UpdateMinRecoveryPoint, xlog.c:2500
2023-05-06 07:53:19.707 UTC [17881][checkpointer] LOG: 00000:
CreateRestartPoint: After XLByteToSeg RedoRecPtr is 0/4000098,
_logSegNo is 4
2023-05-06 07:53:19.707 UTC [17881][checkpointer] LOCATION:
CreateRestartPoint, xlog.c:7271
2023-05-06 07:53:19.707 UTC [17881][checkpointer] LOG: 00000:
CreateRestartPoint: After KeepLogSeg RedoRecPtr is 0/4000098, endptr
is 0/4000148, _logSegNo is 3

Observations:
============
1. In the failed run, the KeepLogSeg(), reduced the _logSegNo to 3
which is the reason for the failure because now the standby won't be
able to remove/recycle the WAL file corresponding to segment number 3
which the test was expecting.
2. We didn't expect the KeepLogSeg() to reduce the _logSegNo because
all logical slots were invalidated. However, I think we forgot that
both standby and primary have physical slots which might also
influence the XLogGetReplicationSlotMinimumLSN() calculation in
KeepLogSeg().
3. Now, the reason for its success in some of the runs is that
restart_lsn of physical slots also moved ahead by the time checkpoint
happens. You can see the difference of LSNs for
ProcessStandbyReplyMessage in failed and successful cases.

Next steps:
=========
1. The first thing is we should verify this theory by adding some LOG
in KeepLogSeg() to see if the _logSegNo is reduced due to the value
returned by XLogGetReplicationSlotMinimumLSN().
2. The reason for the required file not being removed in the primary
is also that it has a physical slot which prevents the file removal.
3. If the above theory is correct then I see a few possibilities to
fix this test (a) somehow ensure that restart_lsn of the physical slot
on standby is advanced up to the point where we can safely remove the
required files; (b) just create a separate test case by initializing a
fresh node for primary and standby where we only have logical slots on
standby. This will be a bit costly but probably less risky. (c) any
better ideas?

--
With Regards,
Amit Kapila.

#51

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#50)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 5/6/23 3:28 PM, Amit Kapila wrote:

On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

There is 2 runs with this extra info in place:

A successful one: https://cirrus-ci.com/task/6528745436086272
A failed one: https://cirrus-ci.com/task/4558139312308224

Thanks, I think I got some clue as to why this test is failing
randomly.

Great, thanks!

Observations:
============
1. In the failed run, the KeepLogSeg(), reduced the _logSegNo to 3
which is the reason for the failure because now the standby won't be
able to remove/recycle the WAL file corresponding to segment number 3
which the test was expecting.

Agree.

2. We didn't expect the KeepLogSeg() to reduce the _logSegNo because
all logical slots were invalidated. However, I think we forgot that
both standby and primary have physical slots which might also
influence the XLogGetReplicationSlotMinimumLSN() calculation in
KeepLogSeg().

Oh right...

Next steps:
=========
1. The first thing is we should verify this theory by adding some LOG
in KeepLogSeg() to see if the _logSegNo is reduced due to the value
returned by XLogGetReplicationSlotMinimumLSN().

Yeah, will do that early next week.

2. The reason for the required file not being removed in the primary
is also that it has a physical slot which prevents the file removal.

Yeah, agree. But this one is not an issue as we are not
checking for the WAL file removal on the primary, do you agree?

3. If the above theory is correct then I see a few possibilities to
fix this test (a) somehow ensure that restart_lsn of the physical slot
on standby is advanced up to the point where we can safely remove the
required files; (b) just create a separate test case by initializing a
fresh node for primary and standby where we only have logical slots on
standby. This will be a bit costly but probably less risky. (c) any
better ideas?

(c): Since, I think, the physical slot on the primary is not a concern for
the reason mentioned above, then instead of (b):

What about postponing the physical slot creation on the standby and the
cascading standby node initialization after this test?

That way, this test would be done without a physical slot on the standby and
we could also get rid of the "Wait for the cascading standby to catchup before
removing the WAL file(s)" part.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#52

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#51)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Sat, May 6, 2023 at 9:33 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 5/6/23 3:28 PM, Amit Kapila wrote:

On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Next steps:
=========
1. The first thing is we should verify this theory by adding some LOG
in KeepLogSeg() to see if the _logSegNo is reduced due to the value
returned by XLogGetReplicationSlotMinimumLSN().

Yeah, will do that early next week.

2. The reason for the required file not being removed in the primary
is also that it has a physical slot which prevents the file removal.

Yeah, agree. But this one is not an issue as we are not
checking for the WAL file removal on the primary, do you agree?

Agreed.

3. If the above theory is correct then I see a few possibilities to
fix this test (a) somehow ensure that restart_lsn of the physical slot
on standby is advanced up to the point where we can safely remove the
required files; (b) just create a separate test case by initializing a
fresh node for primary and standby where we only have logical slots on
standby. This will be a bit costly but probably less risky. (c) any
better ideas?

(c): Since, I think, the physical slot on the primary is not a concern for
the reason mentioned above, then instead of (b):

What about postponing the physical slot creation on the standby and the
cascading standby node initialization after this test?

Yeah, that is also possible. But, I have a few questions regarding
that: (a) There doesn't seem to be a physical slot on cascading
standby, if I am missing something, can you please point me to the
relevant part of the test? (b) Which test is currently dependent on
the physical slot on standby?

--
With Regards,
Amit Kapila.

#53

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#52)

1 attachment(s)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 5/8/23 4:42 AM, Amit Kapila wrote:

On Sat, May 6, 2023 at 9:33 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 5/6/23 3:28 PM, Amit Kapila wrote:

On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Next steps:
=========
1. The first thing is we should verify this theory by adding some LOG
in KeepLogSeg() to see if the _logSegNo is reduced due to the value
returned by XLogGetReplicationSlotMinimumLSN().

Yeah, will do that early next week.

It's done with the following changes:

https://github.com/bdrouvot/postgres/commit/79e1bd9ab429a22f876b9364eb8a0da2dacaaef7#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bL7454-R7514

With that in place, there is one failing test here: https://cirrus-ci.com/task/5173216310722560

Where we can see:

2023-05-08 07:42:56.301 UTC [18038][checkpointer] LOCATION: UpdateMinRecoveryPoint, xlog.c:2500
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: CreateRestartPoint: After XLByteToSeg RedoRecPtr is 0/4000098, _logSegNo is 4
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7271
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: KeepLogSeg: segno changed to 4 due to XLByteToSeg
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: KeepLogSeg, xlog.c:7473
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: KeepLogSeg: segno changed to 3 due to XLogGetReplicationSlotMinimumLSN()
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: KeepLogSeg, xlog.c:7483
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: CreateRestartPoint: After KeepLogSeg RedoRecPtr is 0/4000098, endptr is 0/4000148, _logSegNo is 3
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7284
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: BDT1 about to call RemoveOldXlogFiles in CreateRestartPoint
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7313
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: attempting to remove WAL segments older than log file 000000000000000000000002

So the suspicion about XLogGetReplicationSlotMinimumLSN() was correct (_logSegNo moved from
4 to 3 due to XLogGetReplicationSlotMinimumLSN()).

What about postponing the physical slot creation on the standby and the
cascading standby node initialization after this test?

Yeah, that is also possible. But, I have a few questions regarding
that: (a) There doesn't seem to be a physical slot on cascading
standby, if I am missing something, can you please point me to the
relevant part of the test?

That's right. There is a physical slot only on the primary and on the standby.

What I meant up-thread is to also postpone the cascading standby node initialization
after this test (once the physical slot on the standby is created).

Please find attached a proposal doing so.

(b) Which test is currently dependent on
the physical slot on standby?

Not a test but the cascading standby initialization with the "primary_slot_name" parameter.

Also, now I think that's better to have the physical slot on the standby + hsf set to on on the
cascading standby (coming from the standby backup).

Idea is to avoid any risk of logical slot invalidation on the cascading standby in the
standby promotion test.

That was not the case before the attached proposal though (hsf was off on the cascading standby).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-fix-retaining-WAL-test.patchtext/plain; charset=UTF-8; name=v1-0001-fix-retaining-WAL-test.patchDownload

From 38c67b0fd8f2d83e97a93108484fe23c881dd91c Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 8 May 2023 07:02:50 +0000
Subject: [PATCH v1] fix retaining WAL test

---
 .../t/035_standby_logical_decoding.pl         | 38 ++++++++++---------
 1 file changed, 21 insertions(+), 17 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index ad1def2474..4bda706eae 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -276,20 +276,6 @@ $node_standby->append_conf('postgresql.conf',
        max_replication_slots = 5]);
 $node_standby->start;
 $node_primary->wait_for_replay_catchup($node_standby);
-$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
-
-#######################
-# Initialize cascading standby node
-#######################
-$node_standby->backup($backup_name);
-$node_cascading_standby->init_from_backup(
-	$node_standby, $backup_name,
-	has_streaming => 1,
-	has_restoring => 1);
-$node_cascading_standby->append_conf('postgresql.conf',
-	qq[primary_slot_name = '$standby_physical_slotname']);
-$node_cascading_standby->start;
-$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
 
 #######################
 # Initialize subscriber node
@@ -503,9 +489,6 @@ check_slots_conflicting_status(1);
 # Verify that invalidated logical slots do not lead to retaining WAL.
 ##################################################
 
-# Wait for the cascading standby to catchup before removing the WAL file(s)
-$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
-
 # Get the restart_lsn from an invalidated slot
 my $restart_lsn = $node_standby->safe_psql('postgres',
 	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_activeslot' and conflicting is true;"
@@ -538,6 +521,27 @@ my $standby_walfile = $node_standby->data_dir . '/pg_wal/' . $walfile_name;
 ok(!-f "$standby_walfile",
 	"invalidated logical slots do not lead to retaining WAL");
 
+# Now create the physical slot on the standby and initialize the cascading standby node.
+# It's done here because the physical slot on the standby could prevent the WAL
+# file to be removed in the test above (ensuring the WAL file removal could still be
+# possible but we want to keep the above test simple).
+# Also please note that the cascading standby will be created with hot_standby_feedback
+# set to on thanks to the last change_hot_standby_feedback_and_wait_for_xmins() call.
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 2: conflict due to row removal with hot_standby_feedback off.
-- 
2.34.1

#54

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#53)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Mon, May 8, 2023 at 1:45 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 5/8/23 4:42 AM, Amit Kapila wrote:

On Sat, May 6, 2023 at 9:33 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 5/6/23 3:28 PM, Amit Kapila wrote:

On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Next steps:
=========
1. The first thing is we should verify this theory by adding some LOG
in KeepLogSeg() to see if the _logSegNo is reduced due to the value
returned by XLogGetReplicationSlotMinimumLSN().

Yeah, will do that early next week.

It's done with the following changes:

https://github.com/bdrouvot/postgres/commit/79e1bd9ab429a22f876b9364eb8a0da2dacaaef7#diff-c1cb3ab2a19606390c1a7ed00ffe5a45531702ca5faf999d401c548f8951c65bL7454-R7514

With that in place, there is one failing test here: https://cirrus-ci.com/task/5173216310722560

Where we can see:

2023-05-08 07:42:56.301 UTC [18038][checkpointer] LOCATION: UpdateMinRecoveryPoint, xlog.c:2500
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: CreateRestartPoint: After XLByteToSeg RedoRecPtr is 0/4000098, _logSegNo is 4
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7271
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: KeepLogSeg: segno changed to 4 due to XLByteToSeg
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: KeepLogSeg, xlog.c:7473
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: KeepLogSeg: segno changed to 3 due to XLogGetReplicationSlotMinimumLSN()
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: KeepLogSeg, xlog.c:7483
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: CreateRestartPoint: After KeepLogSeg RedoRecPtr is 0/4000098, endptr is 0/4000148, _logSegNo is 3
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7284
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: BDT1 about to call RemoveOldXlogFiles in CreateRestartPoint
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7313
2023-05-08 07:42:56.302 UTC [18038][checkpointer] LOG: 00000: attempting to remove WAL segments older than log file 000000000000000000000002

So the suspicion about XLogGetReplicationSlotMinimumLSN() was correct (_logSegNo moved from
4 to 3 due to XLogGetReplicationSlotMinimumLSN()).

What about postponing the physical slot creation on the standby and the
cascading standby node initialization after this test?

Yeah, that is also possible. But, I have a few questions regarding
that: (a) There doesn't seem to be a physical slot on cascading
standby, if I am missing something, can you please point me to the
relevant part of the test?

That's right. There is a physical slot only on the primary and on the standby.

What I meant up-thread is to also postpone the cascading standby node initialization
after this test (once the physical slot on the standby is created).

Please find attached a proposal doing so.

(b) Which test is currently dependent on
the physical slot on standby?

Not a test but the cascading standby initialization with the "primary_slot_name" parameter.

Also, now I think that's better to have the physical slot on the standby + hsf set to on on the
cascading standby (coming from the standby backup).

Idea is to avoid any risk of logical slot invalidation on the cascading standby in the
standby promotion test.

Why not initialize the cascading standby node just before the standby
promotion test: "Test standby promotion and logical decoding behavior
after the standby gets promoted."? That way we will avoid any unknown
side-effects of cascading standby and it will anyway look more logical
to initialize it where the test needs it.

--
With Regards,
Amit Kapila.

#55

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#54)

1 attachment(s)

Re: Add two missing tests in 035_standby_logical_decoding.pl

Hi,

On 5/9/23 8:02 AM, Amit Kapila wrote:

On Mon, May 8, 2023 at 1:45 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Why not initialize the cascading standby node just before the standby
promotion test: "Test standby promotion and logical decoding behavior
after the standby gets promoted."? That way we will avoid any unknown
side-effects of cascading standby and it will anyway look more logical
to initialize it where the test needs it.

Yeah, that's even better. Moved the physical slot creation on the standby
and the cascading standby initialization where "strictly" needed in V2
attached.

Also ensuring that hsf is set to on on the cascading standby to be on the
safe side of thing.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-fix-retaining-WAL-test.patchtext/plain; charset=UTF-8; name=v2-0001-fix-retaining-WAL-test.patchDownload

From 0292d14a62b72b38197d434d85ee2c6a7f2cec00 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 9 May 2023 06:43:59 +0000
Subject: [PATCH v2] fix retaining WAL test

---
 .../t/035_standby_logical_decoding.pl         | 36 +++++++++----------
 1 file changed, 18 insertions(+), 18 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index ad1def2474..2b4a688330 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -276,20 +276,6 @@ $node_standby->append_conf('postgresql.conf',
        max_replication_slots = 5]);
 $node_standby->start;
 $node_primary->wait_for_replay_catchup($node_standby);
-$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
-
-#######################
-# Initialize cascading standby node
-#######################
-$node_standby->backup($backup_name);
-$node_cascading_standby->init_from_backup(
-	$node_standby, $backup_name,
-	has_streaming => 1,
-	has_restoring => 1);
-$node_cascading_standby->append_conf('postgresql.conf',
-	qq[primary_slot_name = '$standby_physical_slotname']);
-$node_cascading_standby->start;
-$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
 
 #######################
 # Initialize subscriber node
@@ -503,9 +489,6 @@ check_slots_conflicting_status(1);
 # Verify that invalidated logical slots do not lead to retaining WAL.
 ##################################################
 
-# Wait for the cascading standby to catchup before removing the WAL file(s)
-$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
-
 # Get the restart_lsn from an invalidated slot
 my $restart_lsn = $node_standby->safe_psql('postgres',
 	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_activeslot' and conflicting is true;"
@@ -777,9 +760,26 @@ $node_standby->reload;
 $node_primary->psql('postgres', q[CREATE DATABASE testdb]);
 $node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
 
-# Wait for the standby to catchup before creating the slots
+# Wait for the standby to catchup before initializing the cascading standby
 $node_primary->wait_for_replay_catchup($node_standby);
 
+# Create a physical replication slot on the standby.
+# Keep this step after the "Verify that invalidated logical slots do not lead
+# to retaining WAL" test (as the physical slot on the standby could prevent the
+# WAL file removal).
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+# Initialize cascading standby node
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname'
+	   hot_standby_feedback = on]);
+$node_cascading_standby->start;
+
 # create the logical slots
 create_logical_slots($node_standby, 'promotion_');
 
-- 
2.34.1

#56

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#55)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On Tue, May 9, 2023 at 12:44 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 5/9/23 8:02 AM, Amit Kapila wrote:

On Mon, May 8, 2023 at 1:45 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Why not initialize the cascading standby node just before the standby
promotion test: "Test standby promotion and logical decoding behavior
after the standby gets promoted."? That way we will avoid any unknown
side-effects of cascading standby and it will anyway look more logical
to initialize it where the test needs it.

Yeah, that's even better. Moved the physical slot creation on the standby
and the cascading standby initialization where "strictly" needed in V2
attached.

Also ensuring that hsf is set to on on the cascading standby to be on the
safe side of thing.

Pushed this yesterday.

--
With Regards,
Amit Kapila.

#57

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Amit Kapila (#56)

Re: Add two missing tests in 035_standby_logical_decoding.pl

On 5/10/23 12:41 PM, Amit Kapila wrote:

On Tue, May 9, 2023 at 12:44 PM Drouvot, Bertrand

Pushed this yesterday.

Thanks!

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com