persist logical slots to disk during shutdown checkpoint

amit.kapila16@gmail.com

over 2 years ago

In reply to: Julien Rouhaud (#2)

Re: persist logical slots to disk during shutdown checkpoint

On Sat, Aug 19, 2023 at 12:46 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Sat, 19 Aug 2023, 14:16 Amit Kapila, <amit.kapila16@gmail.com> wrote:

The idea discussed in the thread [1] is to always persist logical
slots to disk during the shutdown checkpoint. I have extracted the
patch to achieve the same from that thread and attached it here. This
could lead to some overhead during shutdown (checkpoint) if there are
many slots but it is probably a one-time work.

I couldn't think of better ideas but another possibility is to mark
the slot as dirty when we update the confirm_flush LSN (see
LogicalConfirmReceivedLocation()). However, that would be a bigger
overhead in the running server as it could be a frequent operation and
could lead to more writes.

Yeah I didn't find any better option either at that time. I still think that forcing persistence on shutdown is the best compromise. If we tried to always mark the slot as dirty, we would be sure to add regular overhead but we would probably end up persisting the slot on disk on shutdown anyway most of the time, so I don't think it would be a good compromise.

The other possibility is that we introduce yet another dirty flag for
slots, say dirty_for_shutdown_checkpoint which will be set when we
update confirmed_flush LSN. The flag will be cleared each time we
persist the slot but we won't persist if only this flag is set. We can
then use it during the shutdown checkpoint to decide whether to
persist the slot.

--
With Regards,
Amit Kapila.

ashutosh.bapat.oss@gmail.com

over 2 years ago

In reply to: Amit Kapila (#3)

Re: persist logical slots to disk during shutdown checkpoint

On Sun, Aug 20, 2023 at 8:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Aug 19, 2023 at 12:46 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Sat, 19 Aug 2023, 14:16 Amit Kapila, <amit.kapila16@gmail.com> wrote:

The idea discussed in the thread [1] is to always persist logical
slots to disk during the shutdown checkpoint. I have extracted the
patch to achieve the same from that thread and attached it here. This
could lead to some overhead during shutdown (checkpoint) if there are
many slots but it is probably a one-time work.

I couldn't think of better ideas but another possibility is to mark
the slot as dirty when we update the confirm_flush LSN (see
LogicalConfirmReceivedLocation()). However, that would be a bigger
overhead in the running server as it could be a frequent operation and
could lead to more writes.

Yeah I didn't find any better option either at that time. I still think that forcing persistence on shutdown is the best compromise. If we tried to always mark the slot as dirty, we would be sure to add regular overhead but we would probably end up persisting the slot on disk on shutdown anyway most of the time, so I don't think it would be a good compromise.

The other possibility is that we introduce yet another dirty flag for
slots, say dirty_for_shutdown_checkpoint which will be set when we
update confirmed_flush LSN. The flag will be cleared each time we
persist the slot but we won't persist if only this flag is set. We can
then use it during the shutdown checkpoint to decide whether to
persist the slot.

There are already two booleans controlling dirty-ness of slot, dirty
and just_dirty. Adding third will created more confusion.

Another idea is to record the confirm_flush_lsn at the time of
persisting the slot. We can use it in two different ways 1. to mark a
slot dirty and persist if the last confirm_flush_lsn when slot was
persisted was too far from the current confirm_flush_lsn of the slot.
2. at shutdown checkpoint, persist all the slots which have these two
confirm_flush_lsns different.

--
Best Wishes,
Ashutosh Bapat

amit.kapila16@gmail.com

over 2 years ago

In reply to: Ashutosh Bapat (#4)

Re: persist logical slots to disk during shutdown checkpoint

On Mon, Aug 21, 2023 at 6:36 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Sun, Aug 20, 2023 at 8:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The other possibility is that we introduce yet another dirty flag for
slots, say dirty_for_shutdown_checkpoint which will be set when we
update confirmed_flush LSN. The flag will be cleared each time we
persist the slot but we won't persist if only this flag is set. We can
then use it during the shutdown checkpoint to decide whether to
persist the slot.

There are already two booleans controlling dirty-ness of slot, dirty
and just_dirty. Adding third will created more confusion.

Another idea is to record the confirm_flush_lsn at the time of
persisting the slot. We can use it in two different ways 1. to mark a
slot dirty and persist if the last confirm_flush_lsn when slot was
persisted was too far from the current confirm_flush_lsn of the slot.
2. at shutdown checkpoint, persist all the slots which have these two
confirm_flush_lsns different.

I think using it in the second (2) way sounds advantageous as compared
to storing another dirty flag because this requires us to update
last_persisted_confirm_flush_lsn only while writing the slot info.
OTOH, having a flag dirty_for_shutdown_checkpoint will require us to
update it each time we update confirm_flush_lsn under spinlock at
multiple places. But, I don't see the need of doing what you proposed
in (1) as the use case for it is very minor, basically this may
sometimes help us to avoid decoding after crash recovery.

--
With Regards,
Amit Kapila.

ashutosh.bapat.oss@gmail.com

over 2 years ago

In reply to: Amit Kapila (#5)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, Aug 22, 2023 at 9:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Another idea is to record the confirm_flush_lsn at the time of
persisting the slot. We can use it in two different ways 1. to mark a
slot dirty and persist if the last confirm_flush_lsn when slot was
persisted was too far from the current confirm_flush_lsn of the slot.
2. at shutdown checkpoint, persist all the slots which have these two
confirm_flush_lsns different.

I think using it in the second (2) way sounds advantageous as compared
to storing another dirty flag because this requires us to update
last_persisted_confirm_flush_lsn only while writing the slot info.
OTOH, having a flag dirty_for_shutdown_checkpoint will require us to
update it each time we update confirm_flush_lsn under spinlock at
multiple places. But, I don't see the need of doing what you proposed
in (1) as the use case for it is very minor, basically this may
sometimes help us to avoid decoding after crash recovery.

Once we have last_persisted_confirm_flush_lsn, (1) is just an
optimization on top of that. With that we take the opportunity to
persist confirmed_flush_lsn which is much farther than the current
persisted value and thus improving chances of updating restart_lsn and
catalog_xmin faster after a WAL sender restart. We need to keep that
in mind when implementing (2). The problem is if we don't implement
(1) right now, we might just forget to do that small incremental
change in future. My preference is 1. Do both (1) and (2) together 2.
Do (2) first and then (1) as a separate commit. 3. Just implement (2)
if we don't have time at all for first two options.

--
Best Wishes,
Ashutosh Bapat

amit.kapila16@gmail.com

over 2 years ago

In reply to: Ashutosh Bapat (#6)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, Aug 22, 2023 at 2:56 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Tue, Aug 22, 2023 at 9:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Another idea is to record the confirm_flush_lsn at the time of
persisting the slot. We can use it in two different ways 1. to mark a
slot dirty and persist if the last confirm_flush_lsn when slot was
persisted was too far from the current confirm_flush_lsn of the slot.
2. at shutdown checkpoint, persist all the slots which have these two
confirm_flush_lsns different.

I think using it in the second (2) way sounds advantageous as compared
to storing another dirty flag because this requires us to update
last_persisted_confirm_flush_lsn only while writing the slot info.
OTOH, having a flag dirty_for_shutdown_checkpoint will require us to
update it each time we update confirm_flush_lsn under spinlock at
multiple places. But, I don't see the need of doing what you proposed
in (1) as the use case for it is very minor, basically this may
sometimes help us to avoid decoding after crash recovery.

Once we have last_persisted_confirm_flush_lsn, (1) is just an
optimization on top of that. With that we take the opportunity to
persist confirmed_flush_lsn which is much farther than the current
persisted value and thus improving chances of updating restart_lsn and
catalog_xmin faster after a WAL sender restart. We need to keep that
in mind when implementing (2). The problem is if we don't implement
(1) right now, we might just forget to do that small incremental
change in future. My preference is 1. Do both (1) and (2) together 2.
Do (2) first and then (1) as a separate commit. 3. Just implement (2)
if we don't have time at all for first two options.

I prefer one of (2) or (3). Anyway, it is better to do that
optimization (persist confirm_flush_lsn at a regular interval) as a
separate patch as we need to test and prove its value separately.

--
With Regards,
Amit Kapila.

[1]: /messages/by-id/CAHut+Ptb=ZYTM_awoLy3sJ5m9Oxe=JYn6Gve5rSW9cUdThpsVA@mail.gmail.com

ashutosh.bapat.oss@gmail.com

over 2 years ago

In reply to: Amit Kapila (#7)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, Aug 22, 2023 at 3:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Once we have last_persisted_confirm_flush_lsn, (1) is just an
optimization on top of that. With that we take the opportunity to
persist confirmed_flush_lsn which is much farther than the current
persisted value and thus improving chances of updating restart_lsn and
catalog_xmin faster after a WAL sender restart. We need to keep that
in mind when implementing (2). The problem is if we don't implement
(1) right now, we might just forget to do that small incremental
change in future. My preference is 1. Do both (1) and (2) together 2.
Do (2) first and then (1) as a separate commit. 3. Just implement (2)
if we don't have time at all for first two options.

I prefer one of (2) or (3). Anyway, it is better to do that
optimization (persist confirm_flush_lsn at a regular interval) as a
separate patch as we need to test and prove its value separately.

Fine with me.

--
Best Wishes,
Ashutosh Bapat

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 2 years ago

In reply to: Amit Kapila (#1)

1 attachment(s)

RE: persist logical slots to disk during shutdown checkpoint

Dear hackers,

Thanks for forking the thread! I think we would choose another design, but I wanted
to post the updated version once with the current approach. All comments came
from the parent thread [1]/messages/by-id/CAHut+Ptb=ZYTM_awoLy3sJ5m9Oxe=JYn6Gve5rSW9cUdThpsVA@mail.gmail.com.

1. GENERAL -- git apply

The patch fails to apply cleanly. There are whitespace warnings.

[postgres(at)CentOS7-x64 oss_postgres_misc]$ git apply
../patches_misc/v23-0001-Always-persist-to-disk-logical-slots-during-a-sh.patch
../patches_misc/v23-0001-Always-persist-to-disk-logical-slots-during-a-sh.patch:102:
trailing whitespace.
# SHUTDOWN_CHECKPOINT record.
warning: 1 line adds whitespace errors.

There was an extra blank, removed.

2. GENERAL -- which patch is the real one and which is the copy?

IMO this patch has become muddled.

Amit recently created a new thread [1] "persist logical slots to disk
during shutdown checkpoint", which I thought was dedicated to the
discussion/implementation of this 0001 patch. Therefore, I expected any
0001 patch changes to would be made only in that new thread from now on,
(and maybe you would mirror them here in this thread).

But now I see there are v23-0001 patch changes here again. So, now the same
patch is in 2 places and they are different. It is no longer clear to me
which 0001 ("Always persist...") patch is the definitive one, and which one
is the copy.

Attached one in another thread is just copy to make cfbot happy, it could be
ignored.

contrib/test_decoding/t/002_always_persist.pl

3.
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test logical replication slots are always persist to disk during a
shutdown
+# checkpoint.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;

/always persist/always persisted/

Fixed.

4.
+
+# Test set-up
my $node = PostgreSQL::Test::Cluster->new('test');
$node->init(allows_streaming => 'logical');
$node->append_conf('postgresql.conf', q{
autovacuum = off
checkpoint_timeout = 1h
});

$node->start;

# Create table
$node->safe_psql('postgres', "CREATE TABLE test (id int)");

Maybe it is better to call the table something different instead of the
same name as the cluster. e.g. 'test_tbl' would be better.

Changed to 'test_tbl'.

5.
+# Shutdown the node once to do shutdown checkpoint
$node->stop();

SUGGESTION
# Stop the node to cause a shutdown checkpoint

Fixed.

6.
+# Fetch checkPoint from the control file itself
my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]);
my @control_data = split("\n", $stdout);
my $latest_checkpoint = undef;
foreach (@control_data)
{
if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
{
$latest_checkpoint = $1;
last;
}
}
die "No checkPoint in control file found\n"
unless defined($latest_checkpoint);

6a.
/checkPoint/checkpoint/ (2x)

6b.
+die "No checkPoint in control file found\n"

SUGGESTION
"No checkpoint found in control file\n"

Hmm, these notations were followed the test recovery/t/016_min_consistency.pl,
it uses the word "minRecoveryPoint". So I preferred current one.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v2-0001-Always-persist-to-disk-logical-slots-during-a-shu.patchapplication/octet-stream; name=v2-0001-Always-persist-to-disk-logical-slots-during-a-shu.patchDownload

From d0cb0d9fecb692b4c0c3940144e3e351cfeadcfc Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Fri, 14 Apr 2023 13:49:09 +0800
Subject: [PATCH v2] Always persist to disk logical slots during a shutdown
 checkpoint.

It's entirely possible for a logical slot to have a confirmed_flush_lsn higher
than the last value saved on disk while not being marked as dirty.  It's
currently not a problem to lose that value during a clean shutdown / restart
cycle, but a later patch adding support for pg_upgrade of publications and
logical slots will rely on that value being properly persisted to disk.

Author: Julien Rouhaud
Reviewed-by: Wang Wei, Peter Smith, Masahiko Sawada
---
 contrib/test_decoding/meson.build             |  1 +
 contrib/test_decoding/t/002_always_persist.pl | 74 +++++++++++++++++++
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/replication/slot.c                | 25 ++++---
 src/include/replication/slot.h                |  2 +-
 5 files changed, 92 insertions(+), 12 deletions(-)
 create mode 100644 contrib/test_decoding/t/002_always_persist.pl

diff --git a/contrib/test_decoding/meson.build b/contrib/test_decoding/meson.build
index 7b05cc25a3..12afb9ea8c 100644
--- a/contrib/test_decoding/meson.build
+++ b/contrib/test_decoding/meson.build
@@ -72,6 +72,7 @@ tests += {
   'tap': {
     'tests': [
       't/001_repl_stats.pl',
+      't/002_always_persist.pl',
     ],
   },
 }
diff --git a/contrib/test_decoding/t/002_always_persist.pl b/contrib/test_decoding/t/002_always_persist.pl
new file mode 100644
index 0000000000..47bf64bb09
--- /dev/null
+++ b/contrib/test_decoding/t/002_always_persist.pl
@@ -0,0 +1,74 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test logical replication slots are always persisted to disk during a shutdown
+# checkpoint.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Test set-up
+my $node = PostgreSQL::Test::Cluster->new('test');
+$node->init(allows_streaming => 'logical');
+$node->append_conf('postgresql.conf', q{
+autovacuum = off
+checkpoint_timeout = 1h
+});
+
+$node->start;
+
+# Create table
+$node->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+
+# Create replication slot
+$node->safe_psql('postgres',
+	"SELECT pg_create_logical_replication_slot('regression_slot1', 'test_decoding');"
+);
+
+# Insert some data
+$node->safe_psql('postgres',
+	"INSERT INTO test_tbl VALUES (generate_series(1, 5));");
+
+# Consume WAL records
+$node->safe_psql('postgres',
+    "SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot1', NULL, NULL);"
+);
+
+# Stop the node to cause a shutdown checkpoint
+$node->stop();
+
+# Fetch checkPoint from the control file itself
+my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]);
+my @control_data = split("\n", $stdout);
+my $latest_checkpoint = undef;
+foreach (@control_data)
+{
+	if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+	{
+		$latest_checkpoint = $1;
+		last;
+	}
+}
+die "No checkPoint in control file found\n"
+  unless defined($latest_checkpoint);
+
+# Boot the node again and check confirmed_flush_lsn. If the slot has persisted,
+# the LSN becomes the same as the latest checkpoint location, which means the
+# SHUTDOWN_CHECKPOINT record.
+$node->start();
+my $confirmed_flush = $node->safe_psql('postgres',
+	"SELECT confirmed_flush_lsn FROM pg_replication_slots;"
+);
+
+# Compare confirmed_flush_lsn and checkPoint
+ok($confirmed_flush eq $latest_checkpoint,
+	"Check confirmed_flush is same as latest checkpoint location");
+
+# Shutdown
+$node->stop;
+
+done_testing();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 60c0b7ec3a..6dced61cf4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7026,7 +7026,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
-	CheckPointReplicationSlots();
+	CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN);
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1dc27264f6..4d1e2d193e 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -109,7 +109,8 @@ static void ReplicationSlotDropPtr(ReplicationSlot *slot);
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
-static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
+static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel,
+						   bool is_shutdown);
 
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
@@ -783,7 +784,7 @@ ReplicationSlotSave(void)
 	Assert(MyReplicationSlot != NULL);
 
 	sprintf(path, "pg_replslot/%s", NameStr(MyReplicationSlot->data.name));
-	SaveSlotToPath(MyReplicationSlot, path, ERROR);
+	SaveSlotToPath(MyReplicationSlot, path, ERROR, false);
 }
 
 /*
@@ -1565,11 +1566,10 @@ restart:
 /*
  * Flush all replication slots to disk.
  *
- * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * is_shutdown is true in case of a shutdown checkpoint.
  */
 void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)
 {
 	int			i;
 
@@ -1594,7 +1594,7 @@ CheckPointReplicationSlots(void)
 
 		/* save the slot to disk, locking is handled in SaveSlotToPath() */
 		sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
-		SaveSlotToPath(s, path, LOG);
+		SaveSlotToPath(s, path, LOG, is_shutdown);
 	}
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
@@ -1700,7 +1700,7 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 
 	/* Write the actual state file. */
 	slot->dirty = true;			/* signal that we really need to write */
-	SaveSlotToPath(slot, tmppath, ERROR);
+	SaveSlotToPath(slot, tmppath, ERROR, false);
 
 	/* Rename the directory into place. */
 	if (rename(tmppath, path) != 0)
@@ -1726,7 +1726,8 @@ CreateSlotOnDisk(ReplicationSlot *slot)
  * Shared functionality between saving and creating a replication slot.
  */
 static void
-SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
+SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel,
+			   bool is_shutdown)
 {
 	char		tmppath[MAXPGPATH];
 	char		path[MAXPGPATH];
@@ -1740,8 +1741,12 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	slot->just_dirtied = false;
 	SpinLockRelease(&slot->mutex);
 
-	/* and don't do anything if there's nothing to write */
-	if (!was_dirty)
+	/*
+	 * Don't do anything if there's nothing to write, unless this is called for
+	 * a logical slot during a shutdown checkpoint, as we want to persist the
+	 * confirmed_flush LSN in that case, even if that's the only modification.
+	 */
+	if (!was_dirty && !(SlotIsLogical(slot) && is_shutdown))
 		return;
 
 	LWLockAcquire(&slot->io_in_progress_lock, LW_EXCLUSIVE);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8a89dc784..7ca37c9f70 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -241,7 +241,7 @@ extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslo
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
-extern void CheckPointReplicationSlots(void);
+extern void CheckPointReplicationSlots(bool is_shutdown);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
-- 
2.27.0

#10

vignesh21@gmail.com

over 2 years ago

In reply to: Amit Kapila (#7)

1 attachment(s)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, 22 Aug 2023 at 15:42, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Aug 22, 2023 at 2:56 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Tue, Aug 22, 2023 at 9:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Another idea is to record the confirm_flush_lsn at the time of
persisting the slot. We can use it in two different ways 1. to mark a
slot dirty and persist if the last confirm_flush_lsn when slot was
persisted was too far from the current confirm_flush_lsn of the slot.
2. at shutdown checkpoint, persist all the slots which have these two
confirm_flush_lsns different.

I think using it in the second (2) way sounds advantageous as compared
to storing another dirty flag because this requires us to update
last_persisted_confirm_flush_lsn only while writing the slot info.
OTOH, having a flag dirty_for_shutdown_checkpoint will require us to
update it each time we update confirm_flush_lsn under spinlock at
multiple places. But, I don't see the need of doing what you proposed
in (1) as the use case for it is very minor, basically this may
sometimes help us to avoid decoding after crash recovery.

Once we have last_persisted_confirm_flush_lsn, (1) is just an
optimization on top of that. With that we take the opportunity to
persist confirmed_flush_lsn which is much farther than the current
persisted value and thus improving chances of updating restart_lsn and
catalog_xmin faster after a WAL sender restart. We need to keep that
in mind when implementing (2). The problem is if we don't implement
(1) right now, we might just forget to do that small incremental
change in future. My preference is 1. Do both (1) and (2) together 2.
Do (2) first and then (1) as a separate commit. 3. Just implement (2)
if we don't have time at all for first two options.

I prefer one of (2) or (3). Anyway, it is better to do that
optimization (persist confirm_flush_lsn at a regular interval) as a
separate patch as we need to test and prove its value separately.

Here is a patch to persist to disk logical slots during a shutdown
checkpoint if the updated confirmed_flush_lsn has not yet been
persisted.

Regards,
Vignesh

Attachments:

v3-0001-Persist-to-disk-logical-slots-during-a-shutdown-c.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Persist-to-disk-logical-slots-during-a-shutdown-c.patchDownload

From f99c7da6f1d65a1324dce4e280040dcb29a912db Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Fri, 14 Apr 2023 13:49:09 +0800
Subject: [PATCH v3] Persist to disk logical slots during a shutdown checkpoint
 if the updated confirmed_flush_lsn has not yet been persisted.

It's entirely possible for a logical slot to have a confirmed_flush_lsn higher
than the last value saved on disk while not being marked as dirty.  It's
currently not a problem to lose that value during a clean shutdown / restart
cycle, but a later patch adding support for pg_upgrade of publications and
logical slots will rely on that value being properly persisted to disk.

Author: Julien Rouhaud
Reviewed-by: Wang Wei, Peter Smith, Masahiko Sawada
---
 contrib/test_decoding/meson.build             |  1 +
 contrib/test_decoding/t/002_always_persist.pl | 74 +++++++++++++++++++
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/replication/slot.c                | 33 ++++++---
 src/include/replication/slot.h                |  5 +-
 5 files changed, 103 insertions(+), 12 deletions(-)
 create mode 100644 contrib/test_decoding/t/002_always_persist.pl

diff --git a/contrib/test_decoding/meson.build b/contrib/test_decoding/meson.build
index 7b05cc25a3..12afb9ea8c 100644
--- a/contrib/test_decoding/meson.build
+++ b/contrib/test_decoding/meson.build
@@ -72,6 +72,7 @@ tests += {
   'tap': {
     'tests': [
       't/001_repl_stats.pl',
+      't/002_always_persist.pl',
     ],
   },
 }
diff --git a/contrib/test_decoding/t/002_always_persist.pl b/contrib/test_decoding/t/002_always_persist.pl
new file mode 100644
index 0000000000..47bf64bb09
--- /dev/null
+++ b/contrib/test_decoding/t/002_always_persist.pl
@@ -0,0 +1,74 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test logical replication slots are always persisted to disk during a shutdown
+# checkpoint.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Test set-up
+my $node = PostgreSQL::Test::Cluster->new('test');
+$node->init(allows_streaming => 'logical');
+$node->append_conf('postgresql.conf', q{
+autovacuum = off
+checkpoint_timeout = 1h
+});
+
+$node->start;
+
+# Create table
+$node->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+
+# Create replication slot
+$node->safe_psql('postgres',
+	"SELECT pg_create_logical_replication_slot('regression_slot1', 'test_decoding');"
+);
+
+# Insert some data
+$node->safe_psql('postgres',
+	"INSERT INTO test_tbl VALUES (generate_series(1, 5));");
+
+# Consume WAL records
+$node->safe_psql('postgres',
+    "SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot1', NULL, NULL);"
+);
+
+# Stop the node to cause a shutdown checkpoint
+$node->stop();
+
+# Fetch checkPoint from the control file itself
+my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]);
+my @control_data = split("\n", $stdout);
+my $latest_checkpoint = undef;
+foreach (@control_data)
+{
+	if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+	{
+		$latest_checkpoint = $1;
+		last;
+	}
+}
+die "No checkPoint in control file found\n"
+  unless defined($latest_checkpoint);
+
+# Boot the node again and check confirmed_flush_lsn. If the slot has persisted,
+# the LSN becomes the same as the latest checkpoint location, which means the
+# SHUTDOWN_CHECKPOINT record.
+$node->start();
+my $confirmed_flush = $node->safe_psql('postgres',
+	"SELECT confirmed_flush_lsn FROM pg_replication_slots;"
+);
+
+# Compare confirmed_flush_lsn and checkPoint
+ok($confirmed_flush eq $latest_checkpoint,
+	"Check confirmed_flush is same as latest checkpoint location");
+
+# Shutdown
+$node->stop;
+
+done_testing();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 60c0b7ec3a..6dced61cf4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7026,7 +7026,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
-	CheckPointReplicationSlots();
+	CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN);
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1dc27264f6..ccd46d5442 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -109,7 +109,8 @@ static void ReplicationSlotDropPtr(ReplicationSlot *slot);
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
-static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
+static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel,
+						   bool is_shutdown);
 
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
@@ -321,6 +322,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->candidate_xmin_lsn = InvalidXLogRecPtr;
 	slot->candidate_restart_valid = InvalidXLogRecPtr;
 	slot->candidate_restart_lsn = InvalidXLogRecPtr;
+	slot->last_persisted_confirmed_flush = InvalidXLogRecPtr;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -783,7 +785,7 @@ ReplicationSlotSave(void)
 	Assert(MyReplicationSlot != NULL);
 
 	sprintf(path, "pg_replslot/%s", NameStr(MyReplicationSlot->data.name));
-	SaveSlotToPath(MyReplicationSlot, path, ERROR);
+	SaveSlotToPath(MyReplicationSlot, path, ERROR, false);
 }
 
 /*
@@ -1565,11 +1567,10 @@ restart:
 /*
  * Flush all replication slots to disk.
  *
- * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * is_shutdown is true in case of a shutdown checkpoint.
  */
 void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)
 {
 	int			i;
 
@@ -1594,7 +1595,7 @@ CheckPointReplicationSlots(void)
 
 		/* save the slot to disk, locking is handled in SaveSlotToPath() */
 		sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
-		SaveSlotToPath(s, path, LOG);
+		SaveSlotToPath(s, path, LOG, is_shutdown);
 	}
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
@@ -1700,7 +1701,7 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 
 	/* Write the actual state file. */
 	slot->dirty = true;			/* signal that we really need to write */
-	SaveSlotToPath(slot, tmppath, ERROR);
+	SaveSlotToPath(slot, tmppath, ERROR, false);
 
 	/* Rename the directory into place. */
 	if (rename(tmppath, path) != 0)
@@ -1726,7 +1727,8 @@ CreateSlotOnDisk(ReplicationSlot *slot)
  * Shared functionality between saving and creating a replication slot.
  */
 static void
-SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
+SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel,
+			   bool is_shutdown)
 {
 	char		tmppath[MAXPGPATH];
 	char		path[MAXPGPATH];
@@ -1740,8 +1742,16 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	slot->just_dirtied = false;
 	SpinLockRelease(&slot->mutex);
 
-	/* and don't do anything if there's nothing to write */
-	if (!was_dirty)
+	/*
+	 * Don't do anything if there's nothing to write, unless this is called for
+	 * a logical slot during a shutdown checkpoint and if the updated
+	 * confirmed_flush LSN has not yet been persisted, as we want to persist
+	 * the updated confirmed_flush LSN in that case, even if that's the only
+	 * modification.
+	 */
+	if (!was_dirty &&
+		!(SlotIsLogical(slot) && is_shutdown &&
+		  (slot->data.confirmed_flush != slot->last_persisted_confirmed_flush)))
 		return;
 
 	LWLockAcquire(&slot->io_in_progress_lock, LW_EXCLUSIVE);
@@ -1871,6 +1881,8 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	SpinLockAcquire(&slot->mutex);
 	if (!slot->just_dirtied)
 		slot->dirty = false;
+
+	slot->last_persisted_confirmed_flush = slot->data.confirmed_flush;
 	SpinLockRelease(&slot->mutex);
 
 	LWLockRelease(&slot->io_in_progress_lock);
@@ -2072,6 +2084,7 @@ RestoreSlotFromDisk(const char *name)
 		slot->candidate_xmin_lsn = InvalidXLogRecPtr;
 		slot->candidate_restart_lsn = InvalidXLogRecPtr;
 		slot->candidate_restart_valid = InvalidXLogRecPtr;
+		slot->last_persisted_confirmed_flush = InvalidXLogRecPtr;
 
 		slot->in_use = true;
 		slot->active_pid = 0;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8a89dc784..b519f7af5f 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -178,6 +178,9 @@ typedef struct ReplicationSlot
 	XLogRecPtr	candidate_xmin_lsn;
 	XLogRecPtr	candidate_restart_valid;
 	XLogRecPtr	candidate_restart_lsn;
+
+	/* The last persisted confirmed flush lsn */
+	XLogRecPtr	last_persisted_confirmed_flush;
 } ReplicationSlot;
 
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
@@ -241,7 +244,7 @@ extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslo
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
-extern void CheckPointReplicationSlots(void);
+extern void CheckPointReplicationSlots(bool is_shutdown);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
-- 
2.34.1

#11

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 2 years ago

In reply to: vignesh C (#10)

1 attachment(s)

RE: persist logical slots to disk during shutdown checkpoint

Dear Vignesh,

Here is a patch to persist to disk logical slots during a shutdown
checkpoint if the updated confirmed_flush_lsn has not yet been
persisted.

Thanks for making the patch with different approach! Here are comments.

01. RestoreSlotFromDisk

```
slot->candidate_xmin_lsn = InvalidXLogRecPtr;
slot->candidate_restart_lsn = InvalidXLogRecPtr;
slot->candidate_restart_valid = InvalidXLogRecPtr;
+ slot->last_persisted_confirmed_flush = InvalidXLogRecPtr;
```

last_persisted_confirmed_flush was set to InvalidXLogRecPtr, but isn't it better
to use cp.slotdata. confirmed_flush? Assuming that the server is shut down immediately,
your patch forces to save.

02. t/002_always_persist.pl

The original author of the patch is me, but I found that the test could pass
without your patch. This is because pg_logical_slot_get_changes()->
pg_logical_slot_get_changes_guts(confirm = true) always mark the slot as dirty.
IIUC we must use the logical replication system to verify the persistence.
Attached test can pass only when patch is applied.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

another_test.patchapplication/octet-stream; name=another_test.patchDownload

diff --git a/src/test/subscription/meson.build b/src/test/subscription/meson.build
index bd673a9d68..cdd2f8ba47 100644
--- a/src/test/subscription/meson.build
+++ b/src/test/subscription/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/031_column_list.pl',
       't/032_subscribe_use_index.pl',
       't/033_run_as_table_owner.pl',
+      't/034_always_persist.pl',
       't/100_bugs.pl',
     ],
   },
diff --git a/src/test/subscription/t/034_always_persist.pl b/src/test/subscription/t/034_always_persist.pl
new file mode 100644
index 0000000000..91481cdaaf
--- /dev/null
+++ b/src/test/subscription/t/034_always_persist.pl
@@ -0,0 +1,90 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test logical replication slots are always persisted to disk during a shutdown
+# checkpoint.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub compare_confirmed_flush
+{
+	my ($node) = @_;
+
+	# Fetch checkPoint from the control file itself
+	my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $latest_checkpoint = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint = $1;
+			last;
+		}
+	}
+	die "No checkPoint in control file found\n"
+	  unless defined($latest_checkpoint);
+
+	# Fetch confirmed_flush_lsn from the pg_replication_slots
+	my $confirmed_flush = $node->safe_psql('postgres',
+		"SELECT confirmed_flush_lsn from pg_catalog.pg_replication_slots;"
+	);
+
+	# ...And compare them
+	ok($confirmed_flush eq $latest_checkpoint,
+		"Check confirmed_flush is same as latest checkpoint location");
+
+	return;
+}
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('pub');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', q{
+autovacuum = off
+checkpoint_timeout = 1h
+});
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = PostgreSQL::Test::Cluster->new('sub');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create table
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+
+# Insert some data
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tbl VALUES (generate_series(1, 5));"
+);
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION pub FOR ALL TABLES");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub CONNECTION '$publisher_connstr' PUBLICATION pub"
+);
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub');
+
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tbl"
+);
+
+is($result, qq(5), "check initial copy was done");
+
+# Restart publisher once. If the slot has persisted, the confirmed_flush_lsn
+# becomes the same as the latest checkpoint location, which means the
+# SHUTDOWN_CHECKPOINT record.
+$node_publisher->restart();
+
+compare_confirmed_flush($node_publisher);
+
+done_testing();

#12

vignesh21@gmail.com

over 2 years ago

In reply to: Hayato Kuroda (Fujitsu) (#11)

Re: persist logical slots to disk during shutdown checkpoint

On Wed, 23 Aug 2023 at 14:21, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Vignesh,

Here is a patch to persist to disk logical slots during a shutdown
checkpoint if the updated confirmed_flush_lsn has not yet been
persisted.

Thanks for making the patch with different approach! Here are comments.

01. RestoreSlotFromDisk

```
slot->candidate_xmin_lsn = InvalidXLogRecPtr;
slot->candidate_restart_lsn = InvalidXLogRecPtr;
slot->candidate_restart_valid = InvalidXLogRecPtr;
+ slot->last_persisted_confirmed_flush = InvalidXLogRecPtr;
```

last_persisted_confirmed_flush was set to InvalidXLogRecPtr, but isn't it better
to use cp.slotdata. confirmed_flush? Assuming that the server is shut down immediately,
your patch forces to save.

02. t/002_always_persist.pl

The original author of the patch is me, but I found that the test could pass
without your patch. This is because pg_logical_slot_get_changes()->
pg_logical_slot_get_changes_guts(confirm = true) always mark the slot as dirty.
IIUC we must use the logical replication system to verify the persistence.
Attached test can pass only when patch is applied.

Here are few other comments that I noticed:

1) I too noticed that the test passes both with and without patch:
diff --git a/contrib/test_decoding/meson.build
b/contrib/test_decoding/meson.build
index 7b05cc25a3..12afb9ea8c 100644
--- a/contrib/test_decoding/meson.build
+++ b/contrib/test_decoding/meson.build
@@ -72,6 +72,7 @@ tests += {
   'tap': {
     'tests': [
       't/001_repl_stats.pl',
+      't/002_always_persist.pl',

2) change checkPoint to checkpoint:
2.a) checkPoint should be checkpoint to maintain consistency across the file:
+# Shutdown the node once to do shutdown checkpoint
+$node->stop();
+
+# Fetch checkPoint from the control file itself
+my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]);
+my @control_data = split("\n", $stdout);
+my $latest_checkpoint = undef;

2.b) similarly here:
+die "No checkPoint in control file found\n"
+  unless defined($latest_checkpoint);

2.c) similarly here too:
+# Compare confirmed_flush_lsn and checkPoint
+ok($confirmed_flush eq $latest_checkpoint,
+       "Check confirmed_flush is same as latest checkpoint location");

3) change checkpoint to "Latest checkpoint location":
3.a) We should change "No checkPoint in control file found\n" to:
"Latest checkpoint location not found in control file\n" as there are
many checkpoint entries in control data

+foreach (@control_data)
+{
+       if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+       {
+               $latest_checkpoint = $1;
+               last;
+       }
+}
+die "No checkPoint in control file found\n"
+  unless defined($latest_checkpoint);

3.b) We should change "Fetch checkPoint from the control file itself" to:
"Fetch Latest checkpoint location from the control file"

+# Fetch checkPoint from the control file itself
+my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]);
+my @control_data = split("\n", $stdout);
+my $latest_checkpoint = undef;
+foreach (@control_data)
+{

Regards,
Vignesh

#13

vignesh21@gmail.com

over 2 years ago

In reply to: Hayato Kuroda (Fujitsu) (#11)

1 attachment(s)

Re: persist logical slots to disk during shutdown checkpoint

On Wed, 23 Aug 2023 at 14:21, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Vignesh,

Here is a patch to persist to disk logical slots during a shutdown
checkpoint if the updated confirmed_flush_lsn has not yet been
persisted.

Thanks for making the patch with different approach! Here are comments.

01. RestoreSlotFromDisk

```
slot->candidate_xmin_lsn = InvalidXLogRecPtr;
slot->candidate_restart_lsn = InvalidXLogRecPtr;
slot->candidate_restart_valid = InvalidXLogRecPtr;
+ slot->last_persisted_confirmed_flush = InvalidXLogRecPtr;
```

last_persisted_confirmed_flush was set to InvalidXLogRecPtr, but isn't it better
to use cp.slotdata. confirmed_flush? Assuming that the server is shut down immediately,
your patch forces to save.

Modified

02. t/002_always_persist.pl

The original author of the patch is me, but I found that the test could pass
without your patch. This is because pg_logical_slot_get_changes()->
pg_logical_slot_get_changes_guts(confirm = true) always mark the slot as dirty.
IIUC we must use the logical replication system to verify the persistence.
Attached test can pass only when patch is applied.

Update the test based on your another_test with slight modifications.

Attached v4 version patch has the changes for the same.

Regards,
Vignesh

Attachments:

v4-0001-Persist-to-disk-logical-slots-during-a-shutdown-c.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Persist-to-disk-logical-slots-during-a-shutdown-c.patchDownload

From 67ecd2665e6dfa8f5446140e16e45592644cead4 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Fri, 14 Apr 2023 13:49:09 +0800
Subject: [PATCH v4] Persist to disk logical slots during a shutdown checkpoint
 if the updated confirmed_flush_lsn has not yet been persisted.

It's entirely possible for a logical slot to have a confirmed_flush_lsn higher
than the last value saved on disk while not being marked as dirty.  It's
currently not a problem to lose that value during a clean shutdown / restart
cycle, but a later patch adding support for pg_upgrade of publications and
logical slots will rely on that value being properly persisted to disk.

Author: Julien Rouhaud
Reviewed-by: Wang Wei, Peter Smith, Masahiko Sawada
---
 src/backend/access/transam/xlog.c             |   2 +-
 src/backend/replication/slot.c                |  33 ++++--
 src/include/replication/slot.h                |   5 +-
 src/test/subscription/meson.build             |   1 +
 src/test/subscription/t/034_always_persist.pl | 106 ++++++++++++++++++
 5 files changed, 135 insertions(+), 12 deletions(-)
 create mode 100644 src/test/subscription/t/034_always_persist.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 60c0b7ec3a..6dced61cf4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7026,7 +7026,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
-	CheckPointReplicationSlots();
+	CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN);
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1dc27264f6..7fa683d563 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -109,7 +109,8 @@ static void ReplicationSlotDropPtr(ReplicationSlot *slot);
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
-static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
+static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel,
+						   bool is_shutdown);
 
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
@@ -321,6 +322,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->candidate_xmin_lsn = InvalidXLogRecPtr;
 	slot->candidate_restart_valid = InvalidXLogRecPtr;
 	slot->candidate_restart_lsn = InvalidXLogRecPtr;
+	slot->last_persisted_confirmed_flush = InvalidXLogRecPtr;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -783,7 +785,7 @@ ReplicationSlotSave(void)
 	Assert(MyReplicationSlot != NULL);
 
 	sprintf(path, "pg_replslot/%s", NameStr(MyReplicationSlot->data.name));
-	SaveSlotToPath(MyReplicationSlot, path, ERROR);
+	SaveSlotToPath(MyReplicationSlot, path, ERROR, false);
 }
 
 /*
@@ -1565,11 +1567,10 @@ restart:
 /*
  * Flush all replication slots to disk.
  *
- * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * is_shutdown is true in case of a shutdown checkpoint.
  */
 void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)
 {
 	int			i;
 
@@ -1594,7 +1595,7 @@ CheckPointReplicationSlots(void)
 
 		/* save the slot to disk, locking is handled in SaveSlotToPath() */
 		sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
-		SaveSlotToPath(s, path, LOG);
+		SaveSlotToPath(s, path, LOG, is_shutdown);
 	}
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
@@ -1700,7 +1701,7 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 
 	/* Write the actual state file. */
 	slot->dirty = true;			/* signal that we really need to write */
-	SaveSlotToPath(slot, tmppath, ERROR);
+	SaveSlotToPath(slot, tmppath, ERROR, false);
 
 	/* Rename the directory into place. */
 	if (rename(tmppath, path) != 0)
@@ -1726,7 +1727,8 @@ CreateSlotOnDisk(ReplicationSlot *slot)
  * Shared functionality between saving and creating a replication slot.
  */
 static void
-SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
+SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel,
+			   bool is_shutdown)
 {
 	char		tmppath[MAXPGPATH];
 	char		path[MAXPGPATH];
@@ -1740,8 +1742,16 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	slot->just_dirtied = false;
 	SpinLockRelease(&slot->mutex);
 
-	/* and don't do anything if there's nothing to write */
-	if (!was_dirty)
+	/*
+	 * Don't do anything if there's nothing to write, unless this is called for
+	 * a logical slot during a shutdown checkpoint and if the updated
+	 * confirmed_flush LSN has not yet been persisted, as we want to persist
+	 * the updated confirmed_flush LSN in that case, even if that's the only
+	 * modification.
+	 */
+	if (!was_dirty &&
+		!(SlotIsLogical(slot) && is_shutdown &&
+		  (slot->data.confirmed_flush != slot->last_persisted_confirmed_flush)))
 		return;
 
 	LWLockAcquire(&slot->io_in_progress_lock, LW_EXCLUSIVE);
@@ -1871,6 +1881,8 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	SpinLockAcquire(&slot->mutex);
 	if (!slot->just_dirtied)
 		slot->dirty = false;
+
+	slot->last_persisted_confirmed_flush = slot->data.confirmed_flush;
 	SpinLockRelease(&slot->mutex);
 
 	LWLockRelease(&slot->io_in_progress_lock);
@@ -2067,6 +2079,7 @@ RestoreSlotFromDisk(const char *name)
 		/* initialize in memory state */
 		slot->effective_xmin = cp.slotdata.xmin;
 		slot->effective_catalog_xmin = cp.slotdata.catalog_xmin;
+		slot->last_persisted_confirmed_flush =  cp.slotdata.confirmed_flush;
 
 		slot->candidate_catalog_xmin = InvalidTransactionId;
 		slot->candidate_xmin_lsn = InvalidXLogRecPtr;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8a89dc784..b519f7af5f 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -178,6 +178,9 @@ typedef struct ReplicationSlot
 	XLogRecPtr	candidate_xmin_lsn;
 	XLogRecPtr	candidate_restart_valid;
 	XLogRecPtr	candidate_restart_lsn;
+
+	/* The last persisted confirmed flush lsn */
+	XLogRecPtr	last_persisted_confirmed_flush;
 } ReplicationSlot;
 
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
@@ -241,7 +244,7 @@ extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslo
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
-extern void CheckPointReplicationSlots(void);
+extern void CheckPointReplicationSlots(bool is_shutdown);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
diff --git a/src/test/subscription/meson.build b/src/test/subscription/meson.build
index bd673a9d68..cdd2f8ba47 100644
--- a/src/test/subscription/meson.build
+++ b/src/test/subscription/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/031_column_list.pl',
       't/032_subscribe_use_index.pl',
       't/033_run_as_table_owner.pl',
+      't/034_always_persist.pl',
       't/100_bugs.pl',
     ],
   },
diff --git a/src/test/subscription/t/034_always_persist.pl b/src/test/subscription/t/034_always_persist.pl
new file mode 100644
index 0000000000..9973476fff
--- /dev/null
+++ b/src/test/subscription/t/034_always_persist.pl
@@ -0,0 +1,106 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test logical replication slots are always persisted to disk during a shutdown
+# checkpoint.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub compare_confirmed_flush
+{
+	my ($node, $confirmed_flush_from_log) = @_;
+
+	# Fetch Latest checkpoint location from the control file itself
+	my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $latest_checkpoint = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint = $1;
+			last;
+		}
+	}
+	die "Latest checkpoint location not found in control file found\n"
+	  unless defined($latest_checkpoint);
+
+	# Is it same as the value read from log?
+	ok($latest_checkpoint eq $confirmed_flush_from_log,
+		"Check the decoding starts from the confirmed_flush which is the same as the latest_checkpoint");
+
+	return;
+}
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('pub');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', q{
+autovacuum = off
+checkpoint_timeout = 1h
+});
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = PostgreSQL::Test::Cluster->new('sub');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create table
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+
+# Insert some data
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tbl VALUES (generate_series(1, 5));"
+);
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION pub FOR ALL TABLES");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub CONNECTION '$publisher_connstr' PUBLICATION pub"
+);
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub');
+
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tbl"
+);
+
+is($result, qq(5), "check initial copy was done");
+
+# Set wal_receiver_status_interval to zero to suppress keepalive messages
+# between nodes.
+$node_subscriber->append_conf('postgresql.conf', q{
+wal_receiver_status_interval = 0
+});
+$node_subscriber->reload();
+
+my $offset = -s $node_publisher->logfile;
+
+# Restart publisher once. If the slot has persisted, the confirmed_flush_lsn
+# becomes the same as the latest checkpoint location, which means the
+# SHUTDOWN_CHECKPOINT record.
+$node_publisher->restart();
+
+# Wait until the walsender creates decoding context
+$node_publisher->wait_for_log(
+	qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./,
+	$offset
+);
+
+# Extract confirmed_flush from the logfile
+my $log_contents = slurp_file($node_publisher->logfile, $offset);
+$log_contents =~
+	qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
+	or die "could not get confirmed_flush_lsn";
+
+compare_confirmed_flush($node_publisher, $1);
+
+done_testing();
-- 
2.34.1

#14

[1]: /messages/by-id/CALDaNm0VrAt24e2FxbOX6eJQ-G_tZ0gVpsFBjzQM99NxG0hZfg@mail.gmail.com

vignesh21@gmail.com

over 2 years ago

In reply to: Amit Kapila (#1)

2 attachment(s)

Re: persist logical slots to disk during shutdown checkpoint

On Sat, 19 Aug 2023 at 11:53, Amit Kapila <amit.kapila16@gmail.com> wrote:

It's entirely possible for a logical slot to have a confirmed_flush
LSN higher than the last value saved on disk while not being marked as
dirty. It's currently not a problem to lose that value during a clean
shutdown / restart cycle but to support the upgrade of logical slots
[1] (see latest patch at [2]), we seem to rely on that value being
properly persisted to disk. During the upgrade, we need to verify that
all the data prior to shudown_checkpoint for the logical slots has
been consumed, otherwise, the downstream may miss some data. Now, to
ensure the same, we are planning to compare the confirm_flush LSN
location with the latest shudown_checkpoint location which means that
the confirm_flush LSN should be updated after restart.

I think this is inefficient even without an upgrade because, after the
restart, this may lead to decoding some data again. Say, we process
some transactions for which we didn't send anything downstream (the
changes got filtered) but the confirm_flush LSN is updated due to
keepalives. As we don't flush the latest value of confirm_flush LSN,
it may lead to processing the same changes again.

I was able to test and verify that we were not processing the same
changes again.
Note: The 0001-Add-logs-to-skip-transaction-filter-insert-operation.patch
has logs to print if a decode transaction is skipped and also a log to
mention if any operation is filtered.
The test.sh script has the steps for a) setting up logical replication
for a table b) perform insert on table that need to be published (this
will be replicated to the subscriber) c) perform insert on a table
that will not be published (this insert will be filtered, it will not
be replicated) d) sleep for 5 seconds e) stop the server f) start the
server
I used the following steps, do the following in HEAD:
a) Apply 0001-Add-logs-to-skip-transaction-filter-insert-operation.patch
patch in Head and build the binaries b) execute test.sh c) view N1.log
file to see that the insert operations were filtered again by seeing
the following logs:
LOG: Filter insert for table tbl2
...
===restart===
...
LOG: Skipping transaction 0/156AD10 as start decode at is greater 0/156AE40
...
LOG: Filter insert for table tbl2

We can see that the insert operations on tbl2 which was filtered
before server was stopped is again filtered after restart too in HEAD.

Lets see that the same changes were not processed again with patch:
a) Apply v4-0001-Persist-to-disk-logical-slots-during-a-shutdown-c.patch
from [1]/messages/by-id/CALDaNm0VrAt24e2FxbOX6eJQ-G_tZ0gVpsFBjzQM99NxG0hZfg@mail.gmail.com also apply
0001-Add-logs-to-skip-transaction-filter-insert-operation.patch patch
and build the binaries b) execute test.sh c) view N1.log file to see
that the insert operations were skipped after restart of server by
seeing the following logs:
LOG: Filter insert for table tbl2
...
===restart===
...
Skipping transaction 0/156AD10 as start decode at is greater 0/156AFB0
...
Skipping transaction 0/156AE80 as start decode at is greater 0/156AFB0

We can see that the insert operations on tbl2 are not processed again
after restart with the patch.

Regards,
Vignesh

Attachments:

test.shtext/x-sh; charset=US-ASCII; name=test.shDownload

0001-Add-logs-to-skip-transaction-filter-insert-operation.patchapplication/octet-stream; name=0001-Add-logs-to-skip-transaction-filter-insert-operation.patchDownload

From 14bd5cf2a3f8673fdd4930b5d8840559802a9feb Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Fri, 25 Aug 2023 15:43:19 +0530
Subject: [PATCH] Add logs to skip transaction/filter insert operation.

Add logs to skip transaction/filter insert operation.
---
 src/backend/replication/logical/snapbuild.c | 2 ++
 src/backend/replication/pgoutput/pgoutput.c | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 843ceba840..6aad93bc2a 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -433,6 +433,8 @@ SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
 bool
 SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
 {
+	if (ptr < builder->start_decoding_at)
+		elog (LOG, "Skipping transaction %X/%X as start decode at is greater %X/%X", LSN_FORMAT_ARGS(ptr), LSN_FORMAT_ARGS(builder->start_decoding_at));
 	return ptr < builder->start_decoding_at;
 }
 
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index b08ca55041..61ac1e6a20 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -1431,7 +1431,11 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			if (!relentry->pubactions.pubinsert)
+			{
+				elog(LOG, "Filter insert for table %s", RelationGetRelationName(relation));
 				return;
+			}
+				
 			break;
 		case REORDER_BUFFER_CHANGE_UPDATE:
 			if (!relentry->pubactions.pubupdate)
-- 
2.34.1

#15

[1]: /messages/by-id/CALj2ACW0rV5gWK8A3m6_X62qH+Vfaq5hznC=i0R5Wojt5+yhyw@mail.gmail.com
[2]: /messages/by-id/CALDaNm2BboFuFVYxyzP4wkv7=8+_TwsD+ugyGhtibTSF4m4XRg@mail.gmail.com

vignesh21@gmail.com

over 2 years ago

In reply to: vignesh C (#14)

1 attachment(s)

Re: persist logical slots to disk during shutdown checkpoint

On Fri, 25 Aug 2023 at 17:40, vignesh C <vignesh21@gmail.com> wrote:

On Sat, 19 Aug 2023 at 11:53, Amit Kapila <amit.kapila16@gmail.com> wrote:

It's entirely possible for a logical slot to have a confirmed_flush
LSN higher than the last value saved on disk while not being marked as
dirty. It's currently not a problem to lose that value during a clean
shutdown / restart cycle but to support the upgrade of logical slots
[1] (see latest patch at [2]), we seem to rely on that value being
properly persisted to disk. During the upgrade, we need to verify that
all the data prior to shudown_checkpoint for the logical slots has
been consumed, otherwise, the downstream may miss some data. Now, to
ensure the same, we are planning to compare the confirm_flush LSN
location with the latest shudown_checkpoint location which means that
the confirm_flush LSN should be updated after restart.

I think this is inefficient even without an upgrade because, after the
restart, this may lead to decoding some data again. Say, we process
some transactions for which we didn't send anything downstream (the
changes got filtered) but the confirm_flush LSN is updated due to
keepalives. As we don't flush the latest value of confirm_flush LSN,
it may lead to processing the same changes again.

I was able to test and verify that we were not processing the same
changes again.
Note: The 0001-Add-logs-to-skip-transaction-filter-insert-operation.patch
has logs to print if a decode transaction is skipped and also a log to
mention if any operation is filtered.
The test.sh script has the steps for a) setting up logical replication
for a table b) perform insert on table that need to be published (this
will be replicated to the subscriber) c) perform insert on a table
that will not be published (this insert will be filtered, it will not
be replicated) d) sleep for 5 seconds e) stop the server f) start the
server
I used the following steps, do the following in HEAD:
a) Apply 0001-Add-logs-to-skip-transaction-filter-insert-operation.patch
patch in Head and build the binaries b) execute test.sh c) view N1.log
file to see that the insert operations were filtered again by seeing
the following logs:
LOG: Filter insert for table tbl2
...
===restart===
...
LOG: Skipping transaction 0/156AD10 as start decode at is greater 0/156AE40
...
LOG: Filter insert for table tbl2

We can see that the insert operations on tbl2 which was filtered
before server was stopped is again filtered after restart too in HEAD.

Lets see that the same changes were not processed again with patch:
a) Apply v4-0001-Persist-to-disk-logical-slots-during-a-shutdown-c.patch
from [1] also apply
0001-Add-logs-to-skip-transaction-filter-insert-operation.patch patch
and build the binaries b) execute test.sh c) view N1.log file to see
that the insert operations were skipped after restart of server by
seeing the following logs:
LOG: Filter insert for table tbl2
...
===restart===
...
Skipping transaction 0/156AD10 as start decode at is greater 0/156AFB0
...
Skipping transaction 0/156AE80 as start decode at is greater 0/156AFB0

We can see that the insert operations on tbl2 are not processed again
after restart with the patch.

Here is another way to test using pg_replslotdata approach that was
proposed earlier at [1]/messages/by-id/CALj2ACW0rV5gWK8A3m6_X62qH+Vfaq5hznC=i0R5Wojt5+yhyw@mail.gmail.com.
I have rebased this on top of HEAD and the v5 version for the same is attached.

We can use the same test as test.sh shared at [2]/messages/by-id/CALDaNm2BboFuFVYxyzP4wkv7=8+_TwsD+ugyGhtibTSF4m4XRg@mail.gmail.com.
When executed with HEAD, it was noticed that confirmed_flush points to
WAL location before both the transaction:
slot_name slot_type datoid persistency xmin catalog_xmin
restart_lsn confirmed_flush two_phase_at two_phase
plugin
--------- --------- ------ ---------- ----
----------- ----------- ---------------
------------ --------- ------
sub logical 5 persistent 0
735 0/1531E28 0/1531E60 0/0
0 pgoutput

WAL record information generated using pg_walinspect for various
records at and after confirmed_flush WAL 0/1531E60:
row_number | start_lsn | end_lsn | prev_lsn | xid |
resource_manager | record_type | record_length |
main_data_length | fpi_length |
description
|
block_ref
------------+-----------+-----------+-----------+-----+------------------+---------------------+---------------+------------------+------------+-------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------
1 | 0/1531E60 | 0/1531EA0 | 0/1531E28 | 0 | Heap2
| PRUNE | 57 | 9 |
0 | snapshotConflictHorizon: 0, nredirected: 0, ndead: 1, nunused: 0,
redirected: [], dead: [1]/messages/by-id/CALj2ACW0rV5gWK8A3m6_X62qH+Vfaq5hznC=i0R5Wojt5+yhyw@mail.gmail.com, unused: []
|
blkref #0: rel 1663/5/1255 fork main blk 58
2 | 0/1531EA0 | 0/1531EE0 | 0/1531E60 | 735 | Heap
| INSERT+INIT | 59 | 3 |
0 | off: 1, flags: 0x08

|
blkref #0: rel 1663/5/16384 fork main blk 0
3 | 0/1531EE0 | 0/1531F20 | 0/1531EA0 | 735 | Heap
| INSERT | 59 | 3 |
0 | off: 2, flags: 0x08

|
blkref #0: rel 1663/5/16384 fork main blk 0
4 | 0/1531F20 | 0/1531F60 | 0/1531EE0 | 735 | Heap
| INSERT | 59 | 3 |
0 | off: 3, flags: 0x08

|
blkref #0: rel 1663/5/16384 fork main blk 0
5 | 0/1531F60 | 0/1531FA0 | 0/1531F20 | 735 | Heap
| INSERT | 59 | 3 |
0 | off: 4, flags: 0x08

|
blkref #0: rel 1663/5/16384 fork main blk 0
6 | 0/1531FA0 | 0/1531FE0 | 0/1531F60 | 735 | Heap
| INSERT | 59 | 3 |
0 | off: 5, flags: 0x08

|
blkref #0: rel 1663/5/16384 fork main blk 0
7 | 0/1531FE0 | 0/1532028 | 0/1531FA0 | 735 | Transaction
| COMMIT | 46 | 20 |
0 | 2023-08-27 23:22:17.161215+05:30

|
8 | 0/1532028 | 0/1532068 | 0/1531FE0 | 736 | Heap
| INSERT+INIT | 59 | 3 |
0 | off: 1, flags: 0x08

|
blkref #0: rel 1663/5/16387 fork main blk 0
9 | 0/1532068 | 0/15320A8 | 0/1532028 | 736 | Heap
| INSERT | 59 | 3 |
0 | off: 2, flags: 0x08

|
blkref #0: rel 1663/5/16387 fork main blk 0
10 | 0/15320A8 | 0/15320E8 | 0/1532068 | 736 | Heap
| INSERT | 59 | 3 |
0 | off: 3, flags: 0x08

|
blkref #0: rel 1663/5/16387 fork main blk 0
11 | 0/15320E8 | 0/1532128 | 0/15320A8 | 736 | Heap
| INSERT | 59 | 3 |
0 | off: 4, flags: 0x08

|
blkref #0: rel 1663/5/16387 fork main blk 0
12 | 0/1532128 | 0/1532168 | 0/15320E8 | 736 | Heap
| INSERT | 59 | 3 |
0 | off: 5, flags: 0x08

|
blkref #0: rel 1663/5/16387 fork main blk 0
13 | 0/1532168 | 0/1532198 | 0/1532128 | 736 | Transaction
| COMMIT | 46 | 20 |
0 | 2023-08-27 23:22:17.174756+05:30

|
14 | 0/1532198 | 0/1532210 | 0/1532168 | 0 | XLOG
| CHECKPOINT_SHUTDOWN | 114 | 88 |
0 | redo 0/1532198; tli 1; prev tli 1; fpw true; xid 0:737; oid 16399;
multi 1; offset 0; oldest xid 723 in DB 1; oldest multi 1 in DB 1;
oldest/newest commit timestamp xid: 0/0; oldest running xid 0;
shutdown |

Whereas the same test executed with the patch applied shows that
confirmed_flush points to CHECKPOINT_SHUTDOWN record:
slot_name slot_type datoid persistency xmin catalog_xmin
restart_lsn confirmed_flush two_phase_at two_phase
plugin
--------- --------- ------ ----------- ---
----------- ----------- ---------------
----------- --------- ------
sub logical 5 persistent 0 735
0/1531E28 0/1532198 0/0 0
pgoutput

Regards,
Vignesh

Attachments:

v5-0001-pg_replslotdata.patchtext/x-patch; charset=US-ASCII; name=v5-0001-pg_replslotdata.patchDownload

From 934f2d32bf6d3b3bb8ae0fcf334aa371fc95de19 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Sun, 27 Aug 2023 22:03:27 +0530
Subject: [PATCH v5] pg_replslotdata

TODO: Display invalidation information for replication slot.
---
 src/backend/replication/slot.c            |  39 ---
 src/bin/Makefile                          |   1 +
 src/bin/pg_replslotdata/.gitignore        |   2 +
 src/bin/pg_replslotdata/Makefile          |  44 +++
 src/bin/pg_replslotdata/nls.mk            |   6 +
 src/bin/pg_replslotdata/pg_replslotdata.c | 362 ++++++++++++++++++++++
 src/bin/pg_replslotdata/t/001_basic.pl    |  11 +
 src/include/replication/slot.h            |  97 +-----
 src/include/replication/slot_common.h     | 147 +++++++++
 9 files changed, 574 insertions(+), 135 deletions(-)
 create mode 100644 src/bin/pg_replslotdata/.gitignore
 create mode 100644 src/bin/pg_replslotdata/Makefile
 create mode 100644 src/bin/pg_replslotdata/nls.mk
 create mode 100644 src/bin/pg_replslotdata/pg_replslotdata.c
 create mode 100644 src/bin/pg_replslotdata/t/001_basic.pl
 create mode 100644 src/include/replication/slot_common.h

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index bb09c4010f..567d61540a 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -53,45 +53,6 @@
 #include "storage/procarray.h"
 #include "utils/builtins.h"
 
-/*
- * Replication slot on-disk data structure.
- */
-typedef struct ReplicationSlotOnDisk
-{
-	/* first part of this struct needs to be version independent */
-
-	/* data not covered by checksum */
-	uint32		magic;
-	pg_crc32c	checksum;
-
-	/* data covered by checksum */
-	uint32		version;
-	uint32		length;
-
-	/*
-	 * The actual data in the slot that follows can differ based on the above
-	 * 'version'.
-	 */
-
-	ReplicationSlotPersistentData slotdata;
-} ReplicationSlotOnDisk;
-
-/* size of version independent data */
-#define ReplicationSlotOnDiskConstantSize \
-	offsetof(ReplicationSlotOnDisk, slotdata)
-/* size of the part of the slot not covered by the checksum */
-#define ReplicationSlotOnDiskNotChecksummedSize  \
-	offsetof(ReplicationSlotOnDisk, version)
-/* size of the part covered by the checksum */
-#define ReplicationSlotOnDiskChecksummedSize \
-	sizeof(ReplicationSlotOnDisk) - ReplicationSlotOnDiskNotChecksummedSize
-/* size of the slot data that is version dependent */
-#define ReplicationSlotOnDiskV2Size \
-	sizeof(ReplicationSlotOnDisk) - ReplicationSlotOnDiskConstantSize
-
-#define SLOT_MAGIC		0x1051CA1	/* format identifier */
-#define SLOT_VERSION	3		/* version for new files */
-
 /* Control array for replication slot management */
 ReplicationSlotCtlData *ReplicationSlotCtl = NULL;
 
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..1db7dd00d4 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
 	pg_controldata \
 	pg_ctl \
 	pg_dump \
+	pg_replslotdata \
 	pg_resetwal \
 	pg_rewind \
 	pg_test_fsync \
diff --git a/src/bin/pg_replslotdata/.gitignore b/src/bin/pg_replslotdata/.gitignore
new file mode 100644
index 0000000000..13a4afb8ef
--- /dev/null
+++ b/src/bin/pg_replslotdata/.gitignore
@@ -0,0 +1,2 @@
+/pg_replslotdata
+/tmp_check/
diff --git a/src/bin/pg_replslotdata/Makefile b/src/bin/pg_replslotdata/Makefile
new file mode 100644
index 0000000000..69518ee53b
--- /dev/null
+++ b/src/bin/pg_replslotdata/Makefile
@@ -0,0 +1,44 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_replslotdata
+#
+# Copyright (c) 1998-2021, PostgreSQL Global Development Group
+#
+# src/bin/pg_replslotdata/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_replslotdata - provides information about the replication slots from $PGDATA/pg_replslot/<slot_name> $PGDATA/pg_replslot/<slot_name>"
+PGAPPICON=win32
+
+subdir = src/bin/pg_replslotdata
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	$(WIN32RES) \
+	pg_replslotdata.o
+
+all: pg_replslotdata
+
+pg_replslotdata: $(OBJS) | submake-libpgport
+	$(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+	$(INSTALL_PROGRAM) pg_replslotdata$(X) '$(DESTDIR)$(bindir)/pg_replslotdata$(X)'
+
+installdirs:
+	$(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+	rm -f '$(DESTDIR)$(bindir)/pg_replslotdata$(X)'
+
+clean distclean maintainer-clean:
+	rm -f pg_replslotdata$(X) $(OBJS)
+	rm -rf tmp_check
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
diff --git a/src/bin/pg_replslotdata/nls.mk b/src/bin/pg_replslotdata/nls.mk
new file mode 100644
index 0000000000..74bee593c9
--- /dev/null
+++ b/src/bin/pg_replslotdata/nls.mk
@@ -0,0 +1,6 @@
+# src/bin/pg_replslotdata/nls.mk
+CATALOG_NAME     = pg_replslotdata
+AVAIL_LANGUAGES  = cs de el es fr ja ko pl ru sv tr uk vi zh_CN
+GETTEXT_FILES    = $(FRONTEND_COMMON_GETTEXT_FILES) pg_replslotdata.c
+GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS)
+GETTEXT_FLAGS    = $(FRONTEND_COMMON_GETTEXT_FLAGS)
diff --git a/src/bin/pg_replslotdata/pg_replslotdata.c b/src/bin/pg_replslotdata/pg_replslotdata.c
new file mode 100644
index 0000000000..aed5d6750c
--- /dev/null
+++ b/src/bin/pg_replslotdata/pg_replslotdata.c
@@ -0,0 +1,362 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_replslotdata.c - provides information about the replication slots
+ * from $PGDATA/pg_replslot/<slot_name>.
+ *
+ * Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  src/bin/pg_replslotdata/pg_replslotdata.c
+ *-------------------------------------------------------------------------
+ */
+/*
+ * We have to use postgres.h not postgres_fe.h here, because there's so much
+ * backend-only stuff in the XLOG include files we need.  But we need a
+ * frontend-ish environment otherwise.  Hence this ugly hack.
+ */
+#define FRONTEND 1
+
+#include "postgres.h"
+
+#include <dirent.h>
+#include <sys/stat.h>
+
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "common/logging.h"
+#include "common/string.h"
+#include "getopt_long.h"
+#include "pg_getopt.h"
+#include "replication/slot_common.h"
+
+static bool verbose = false;
+
+static void process_replslots(void);
+static void read_and_display_repl_slot(const char *name);
+
+static void
+usage(const char *progname)
+{
+	printf(_("%s displays information about the replication slots from $PGDATA/pg_replslot/<slot_name>.\n\n"), progname);
+	printf(_("Usage:\n"));
+	printf(_("  %s [OPTION] [DATADIR]\n"), progname);
+	printf(_("\nOptions:\n"));
+	printf(_(" [-D, --pgdata=]DATADIR  data directory\n"));
+	printf(_("  -V, --version          output version information, then exit\n"));
+	printf(_("  -v, --verbose          write a lot of output\n"));
+	printf(_("  -?, --help             show this help, then exit\n"));
+	printf(_("\nIf no data directory (DATADIR) is specified, "
+			 "the environment variable PGDATA\nis used.\n\n"));
+	printf(_("Report bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+	printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+static void
+process_replslots(void)
+{
+	DIR		   *rsdir;
+	struct dirent *rsde;
+	uint32		cnt = 0;
+
+	rsdir = opendir("pg_replslot");
+	if (rsdir == NULL)
+	{
+		pg_log_error("could not open directory \"%s\": %m", "pg_replslot");
+		exit(1);
+	}
+
+	/* XXX: comment here about the format spefiiers */
+	printf("%-64s %9s %10s %11s %10s %12s %21s %21s %21s %21s %10s %20s\n"
+		   "%-64s %9s %10s %11s %10s %12s %21s %21s %21s %21s %10s %20s\n",
+		   "slot_name", "slot_type", "datoid", "persistency", "xmin", "catalog_xmin", "restart_lsn", "invalidated_at", "confirmed_flush", "two_phase_at", "two_phase", "plugin",
+		   "---------", "---------", "------", "-----------", "----", "------------", "-----------", "--------------", "---------------", "------------", "---------", "------");
+
+	while (errno = 0, (rsde = readdir(rsdir)) != NULL)
+	{
+		struct stat statbuf;
+		char		path[MAXPGPATH];
+
+		if (strcmp(rsde->d_name, ".") == 0 ||
+			strcmp(rsde->d_name, "..") == 0)
+			continue;
+
+		snprintf(path, sizeof(path), "pg_replslot/%s", rsde->d_name);
+
+		/* we're only creating directories here, skip if it's not our's */
+		if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
+			continue;
+
+		/* we crashed while a slot was being setup or deleted, clean up */
+		if (pg_str_endswith(rsde->d_name, ".tmp"))
+		{
+			pg_log_warning("server was crashed while the slot \"%s\" was being setup or deleted",
+						   rsde->d_name);
+			continue;
+		}
+
+		/* looks like a slot in a normal state, restore */
+		read_and_display_repl_slot(rsde->d_name);
+		cnt++;
+	}
+
+	if (errno)
+	{
+		pg_log_error("could not read directory \"%s\": %m", "pg_replslot");
+		exit(1);
+	}
+
+	if (cnt == 0)
+	{
+		pg_log_info("no replication slots were found");
+		exit(0);
+	}
+
+	if (closedir(rsdir))
+	{
+		pg_log_error("could not close directory \"%s\": %m", "pg_replslot");
+		exit(1);
+	}
+}
+
+static void
+read_and_display_repl_slot(const char *name)
+{
+	ReplicationSlotOnDisk cp;
+	char		slotdir[MAXPGPATH];
+	char		path[MAXPGPATH];
+	char		restart_lsn[NAMEDATALEN];
+	char		confirmed_flush[NAMEDATALEN];
+	char		two_phase_at[NAMEDATALEN];
+	char		persistency[NAMEDATALEN];
+	int			fd;
+	int			readBytes;
+	pg_crc32c	checksum;
+
+	/* delete temp file if it exists */
+	sprintf(slotdir, "pg_replslot/%s", name);
+	sprintf(path, "%s/state.tmp", slotdir);
+
+	fd = open(path, O_RDONLY | PG_BINARY, 0);
+
+	if (fd > 0)
+	{
+		pg_log_error("found temporary state file \"%s\": %m", path);
+		exit(1);
+	}
+
+	sprintf(path, "%s/state", slotdir);
+
+	if (verbose)
+		pg_log_info("reading replication slot from \"%s\"", path);
+
+	fd = open(path, O_RDONLY | PG_BINARY, 0);
+
+	/*
+	 * We do not need to handle this as we are rename()ing the directory into
+	 * place only after we fsync()ed the state file.
+	 */
+	if (fd < 0)
+	{
+		pg_log_error("could not open file \"%s\": %m", path);
+		exit(1);
+	}
+
+	if (verbose)
+		pg_log_info("reading version independent replication slot state file");
+
+	/* read part of statefile that's guaranteed to be version independent */
+	readBytes = read(fd, &cp, ReplicationSlotOnDiskConstantSize);
+	if (readBytes != ReplicationSlotOnDiskConstantSize)
+	{
+		if (readBytes < 0)
+		{
+			pg_log_error("could not read file \"%s\": %m", path);
+			exit(1);
+		}
+		else
+		{
+			pg_log_error("could not read file \"%s\": read %d of %zu",
+						 path, readBytes,
+						 (Size) ReplicationSlotOnDiskConstantSize);
+			exit(1);
+		}
+	}
+
+	/* verify magic */
+	if (cp.magic != SLOT_MAGIC)
+	{
+		pg_log_error("replication slot file \"%s\" has wrong magic number: %u instead of %u",
+					 path, cp.magic, SLOT_MAGIC);
+		exit(1);
+	}
+
+	/* verify version */
+	if (cp.version != SLOT_VERSION)
+	{
+		pg_log_error("replication slot file \"%s\" has unsupported version %u",
+					 path, cp.version);
+		exit(1);
+	}
+
+	/* boundary check on length */
+	if (cp.length != ReplicationSlotOnDiskV2Size)
+	{
+		pg_log_error("replication slot file \"%s\" has corrupted length %u",
+					 path, cp.length);
+		exit(1);
+	}
+
+	if (verbose)
+		pg_log_info("reading the entire replication slot state file");
+
+	/* now that we know the size, read the entire file */
+	readBytes = read(fd,
+					 (char *) &cp + ReplicationSlotOnDiskConstantSize,
+					 cp.length);
+	if (readBytes != cp.length)
+	{
+		if (readBytes < 0)
+		{
+			pg_log_error("could not read file \"%s\": %m", path);
+			exit(1);
+		}
+		else
+		{
+			pg_log_error("could not read file \"%s\": read %d of %zu",
+						 path, readBytes, (Size) cp.length);
+			exit(1);
+		}
+	}
+
+	if (close(fd) != 0)
+	{
+		pg_log_error("could not close file \"%s\": %m", path);
+		exit(1);
+	}
+
+	/* now verify the CRC */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum,
+				(char *) &cp + ReplicationSlotOnDiskNotChecksummedSize,
+				ReplicationSlotOnDiskChecksummedSize);
+	FIN_CRC32C(checksum);
+
+	if (!EQ_CRC32C(checksum, cp.checksum))
+	{
+		pg_log_error("checksum mismatch for replication slot file \"%s\": is %u, should be %u",
+					 path, checksum, cp.checksum);
+		exit(1);
+	}
+
+	sprintf(restart_lsn, "%X/%X", LSN_FORMAT_ARGS(cp.slotdata.restart_lsn));
+	sprintf(confirmed_flush, "%X/%X", LSN_FORMAT_ARGS(cp.slotdata.confirmed_flush));
+	sprintf(two_phase_at, "%X/%X", LSN_FORMAT_ARGS(cp.slotdata.two_phase_at));
+
+	if (cp.slotdata.persistency == RS_PERSISTENT)
+		sprintf(persistency, "persistent");
+	else if (cp.slotdata.persistency == RS_EPHEMERAL)
+		sprintf(persistency, "ephemeral");
+	else if (cp.slotdata.persistency == RS_TEMPORARY)
+		sprintf(persistency, "temporary");
+
+	/* display the slot information */
+	printf("%-64s %9s %10u %11s %10u %12u %21s %21s %21s %10d %20s\n",
+		   NameStr(cp.slotdata.name),
+		   cp.slotdata.database == InvalidOid ? "physical" : "logical",
+		   cp.slotdata.database,
+		   persistency,
+		   cp.slotdata.xmin,
+		   cp.slotdata.catalog_xmin,
+		   restart_lsn,
+		   confirmed_flush,
+		   two_phase_at,
+		   cp.slotdata.two_phase,
+		   NameStr(cp.slotdata.plugin));
+}
+
+int
+main(int argc, char *argv[])
+{
+	static struct option long_options[] = {
+		{"pgdata", required_argument, NULL, 'D'},
+		{"verbose", no_argument, NULL, 'v'},
+		{NULL, 0, NULL, 0}
+	};
+
+	char	   *DataDir = NULL;
+	const char *progname;
+	int			c;
+
+	pg_logging_init(argv[0]);
+	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_replslotdata"));
+	progname = get_progname(argv[0]);
+
+	if (argc > 1)
+	{
+		if (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-?") == 0)
+		{
+			usage(progname);
+			exit(0);
+		}
+		if (strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-V") == 0)
+		{
+			puts("pg_replslotdata (PostgreSQL) " PG_VERSION);
+			exit(0);
+		}
+	}
+
+	while ((c = getopt_long(argc, argv, "D:v", long_options, NULL)) != -1)
+	{
+		switch (c)
+		{
+			case 'D':
+				DataDir = optarg;
+				break;
+			case 'v':
+				verbose = true;
+				break;
+			default:
+				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+				exit(1);
+		}
+	}
+
+	if (DataDir == NULL)
+	{
+		if (optind < argc)
+			DataDir = argv[optind++];
+		else
+			DataDir = getenv("PGDATA");
+	}
+
+	/* complain if any arguments remain */
+	if (optind < argc)
+	{
+		pg_log_error("too many command-line arguments (first is \"%s\")",
+					 argv[optind]);
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}
+
+	if (DataDir == NULL)
+	{
+		pg_log_error("no data directory specified");
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+		exit(1);
+	}
+
+	if (verbose)
+		pg_log_info("data directory: \"%s\"", DataDir);
+
+	if (chdir(DataDir) < 0)
+	{
+		pg_log_error("could not change directory to \"%s\": %m",
+					 DataDir);
+		exit(1);
+	}
+
+	process_replslots();
+
+	return 0;
+}
diff --git a/src/bin/pg_replslotdata/t/001_basic.pl b/src/bin/pg_replslotdata/t/001_basic.pl
new file mode 100644
index 0000000000..d6830dc2ac
--- /dev/null
+++ b/src/bin/pg_replslotdata/t/001_basic.pl
@@ -0,0 +1,11 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 8;
+
+program_help_ok('pg_replslotdata');
+program_version_ok('pg_replslotdata');
+program_options_handling_ok('pg_replslotdata');
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8a89dc784..a7d16a37a3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -15,104 +15,9 @@
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
+#include "replication/slot_common.h"
 #include "replication/walreceiver.h"
 
-/*
- * Behaviour of replication slots, upon release or crash.
- *
- * Slots marked as PERSISTENT are crash-safe and will not be dropped when
- * released. Slots marked as EPHEMERAL will be dropped when released or after
- * restarts.  Slots marked TEMPORARY will be dropped at the end of a session
- * or on error.
- *
- * EPHEMERAL is used as a not-quite-ready state when creating persistent
- * slots.  EPHEMERAL slots can be made PERSISTENT by calling
- * ReplicationSlotPersist().  For a slot that goes away at the end of a
- * session, TEMPORARY is the appropriate choice.
- */
-typedef enum ReplicationSlotPersistency
-{
-	RS_PERSISTENT,
-	RS_EPHEMERAL,
-	RS_TEMPORARY
-} ReplicationSlotPersistency;
-
-/*
- * Slots can be invalidated, e.g. due to max_slot_wal_keep_size. If so, the
- * 'invalidated' field is set to a value other than _NONE.
- */
-typedef enum ReplicationSlotInvalidationCause
-{
-	RS_INVAL_NONE,
-	/* required WAL has been removed */
-	RS_INVAL_WAL_REMOVED,
-	/* required rows have been removed */
-	RS_INVAL_HORIZON,
-	/* wal_level insufficient for slot */
-	RS_INVAL_WAL_LEVEL,
-} ReplicationSlotInvalidationCause;
-
-/*
- * On-Disk data of a replication slot, preserved across restarts.
- */
-typedef struct ReplicationSlotPersistentData
-{
-	/* The slot's identifier */
-	NameData	name;
-
-	/* database the slot is active on */
-	Oid			database;
-
-	/*
-	 * The slot's behaviour when being dropped (or restored after a crash).
-	 */
-	ReplicationSlotPersistency persistency;
-
-	/*
-	 * xmin horizon for data
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId xmin;
-
-	/*
-	 * xmin horizon for catalog tuples
-	 *
-	 * NB: This may represent a value that hasn't been written to disk yet;
-	 * see notes for effective_xmin, below.
-	 */
-	TransactionId catalog_xmin;
-
-	/* oldest LSN that might be required by this replication slot */
-	XLogRecPtr	restart_lsn;
-
-	/* RS_INVAL_NONE if valid, or the reason for having been invalidated */
-	ReplicationSlotInvalidationCause invalidated;
-
-	/*
-	 * Oldest LSN that the client has acked receipt for.  This is used as the
-	 * start_lsn point in case the client doesn't specify one, and also as a
-	 * safety measure to jump forwards in case the client specifies a
-	 * start_lsn that's further in the past than this value.
-	 */
-	XLogRecPtr	confirmed_flush;
-
-	/*
-	 * LSN at which we enabled two_phase commit for this slot or LSN at which
-	 * we found a consistent point at the time of slot creation.
-	 */
-	XLogRecPtr	two_phase_at;
-
-	/*
-	 * Allow decoding of prepared transactions?
-	 */
-	bool		two_phase;
-
-	/* plugin name */
-	NameData	plugin;
-} ReplicationSlotPersistentData;
-
 /*
  * Shared memory state of a single replication slot.
  *
diff --git a/src/include/replication/slot_common.h b/src/include/replication/slot_common.h
new file mode 100644
index 0000000000..ff4556ff22
--- /dev/null
+++ b/src/include/replication/slot_common.h
@@ -0,0 +1,147 @@
+/*-------------------------------------------------------------------------
+ * slot_common.h
+ *	   Replication slot management.
+ *
+ * Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SLOT_COMMON_H
+#define SLOT_COMMON_H
+
+/*
+ * Slots can be invalidated, e.g. due to max_slot_wal_keep_size. If so, the
+ * 'invalidated' field is set to a value other than _NONE.
+ */
+typedef enum ReplicationSlotInvalidationCause
+{
+	RS_INVAL_NONE,
+	/* required WAL has been removed */
+	RS_INVAL_WAL_REMOVED,
+	/* required rows have been removed */
+	RS_INVAL_HORIZON,
+	/* wal_level insufficient for slot */
+	RS_INVAL_WAL_LEVEL,
+} ReplicationSlotInvalidationCause;
+
+/*
+ * Behaviour of replication slots, upon release or crash.
+ *
+ * Slots marked as PERSISTENT are crash-safe and will not be dropped when
+ * released. Slots marked as EPHEMERAL will be dropped when released or after
+ * restarts.  Slots marked TEMPORARY will be dropped at the end of a session
+ * or on error.
+ *
+ * EPHEMERAL is used as a not-quite-ready state when creating persistent
+ * slots.  EPHEMERAL slots can be made PERSISTENT by calling
+ * ReplicationSlotPersist().  For a slot that goes away at the end of a
+ * session, TEMPORARY is the appropriate choice.
+ */
+typedef enum ReplicationSlotPersistency
+{
+	RS_PERSISTENT,
+	RS_EPHEMERAL,
+	RS_TEMPORARY
+} ReplicationSlotPersistency;
+
+/*
+ * On-Disk data of a replication slot, preserved across restarts.
+ */
+typedef struct ReplicationSlotPersistentData
+{
+	/* The slot's identifier */
+	NameData	name;
+
+	/* database the slot is active on */
+	Oid			database;
+
+	/*
+	 * The slot's behaviour when being dropped (or restored after a crash).
+	 */
+	ReplicationSlotPersistency persistency;
+
+	/*
+	 * xmin horizon for data
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId xmin;
+
+	/*
+	 * xmin horizon for catalog tuples
+	 *
+	 * NB: This may represent a value that hasn't been written to disk yet;
+	 * see notes for effective_xmin, below.
+	 */
+	TransactionId catalog_xmin;
+
+	/* oldest LSN that might be required by this replication slot */
+	XLogRecPtr	restart_lsn;
+
+	/* RS_INVAL_NONE if valid, or the reason for having been invalidated */
+	ReplicationSlotInvalidationCause invalidated;
+
+	/*
+	 * Oldest LSN that the client has acked receipt for.  This is used as the
+	 * start_lsn point in case the client doesn't specify one, and also as a
+	 * safety measure to jump forwards in case the client specifies a
+	 * start_lsn that's further in the past than this value.
+	 */
+	XLogRecPtr	confirmed_flush;
+
+	/*
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
+	 */
+	XLogRecPtr	two_phase_at;
+
+	/*
+	 * Allow decoding of prepared transactions?
+	 */
+	bool		two_phase;
+
+	/* plugin name */
+	NameData	plugin;
+} ReplicationSlotPersistentData;
+
+/*
+ * Replication slot on-disk data structure.
+ */
+typedef struct ReplicationSlotOnDisk
+{
+	/* first part of this struct needs to be version independent */
+
+	/* data not covered by checksum */
+	uint32		magic;
+	pg_crc32c	checksum;
+
+	/* data covered by checksum */
+	uint32		version;
+	uint32		length;
+
+	/*
+	 * The actual data in the slot that follows can differ based on the above
+	 * 'version'.
+	 */
+
+	ReplicationSlotPersistentData slotdata;
+} ReplicationSlotOnDisk;
+
+/* size of version independent data */
+#define ReplicationSlotOnDiskConstantSize \
+	offsetof(ReplicationSlotOnDisk, slotdata)
+/* size of the part of the slot not covered by the checksum */
+#define ReplicationSlotOnDiskNotChecksummedSize \
+	offsetof(ReplicationSlotOnDisk, version)
+/* size of the part covered by the checksum */
+#define ReplicationSlotOnDiskChecksummedSize \
+	sizeof(ReplicationSlotOnDisk) - ReplicationSlotOnDiskNotChecksummedSize
+/* size of the slot data that is version dependent */
+#define ReplicationSlotOnDiskV2Size \
+	sizeof(ReplicationSlotOnDisk) - ReplicationSlotOnDiskConstantSize
+
+#define SLOT_MAGIC		0x1051CA1	/* format identifier */
+#define SLOT_VERSION	3		/* version for new files */
+
+#endif							/* SLOT_COMMON_H */
-- 
2.34.1

#16

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 2 years ago

In reply to: vignesh C (#13)

1 attachment(s)

RE: persist logical slots to disk during shutdown checkpoint

Dear hackers,

I also tested for logical slots on the physical standby. PSA the script.
confirmed_flush_lsn for such slots were successfully persisted.

# Topology

In this test nodes are connected each other.

node1 --(physical replication)-->node2--(logical replication)-->node3

# Test method

An attached script did following steps

1. constructed above configurations
2. Inserted data on node1
3. read confirmed_flush_lsn on node2 (a)
4. restarted node2
5. read confirmed_flush_lsn again on node2 (b)
6. compare (a) and (b)

# result

Before patching, (a) and (b) were different value, which meant that logical
slots on physical standby were not saved at shutdown.

```
slot_name | confirmed_flush_lsn
-----------+---------------------
sub | 0/30003E8
(1 row)

waiting for server to shut down.... done
server stopped
waiting for server to start.... done
server started
slot_name | confirmed_flush_lsn
-----------+---------------------
sub | 0/30000D8
(1 row)
```

After patching, (a) and (b) became the same value. The v4 patch worked well even
if the node is physical standby.

```
slot_name | confirmed_flush_lsn
-----------+---------------------
sub | 0/30003E8
(1 row)

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#17

amit.kapila16@gmail.com

over 2 years ago

In reply to: vignesh C (#13)

1 attachment(s)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Aug 24, 2023 at 11:44 AM vignesh C <vignesh21@gmail.com> wrote:

The patch looks mostly good to me. I have made minor changes which are
as follows: (a) removed the autovacuum =off and
wal_receiver_status_interval = 0 setting as those doesn't seem to be
required for the test; (b) changed a few comments and variable names
in the code and test;

Shall we change the test file name from always_persist to
save_logical_slots_shutdown and move to recovery/t/ as this test is
about verification after the restart of the server?

--
With Regards,
Amit Kapila.

Attachments:

v5-0001-Persist-to-disk-logical-slots-during-a-shutdown-c.patchapplication/octet-stream; name=v5-0001-Persist-to-disk-logical-slots-during-a-shutdown-c.patchDownload

From b4d7674ca163da5bcf1d2ddb793955a43d8c8c6c Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Fri, 14 Apr 2023 13:49:09 +0800
Subject: [PATCH v6] Persist to disk logical slots during a shutdown checkpoint
 if the updated confirmed_flush_lsn has not yet been persisted.

It's entirely possible for a logical slot to have a confirmed_flush_lsn higher
than the last value saved on disk while not being marked as dirty.  It's
currently not a problem to lose that value during a clean shutdown / restart
cycle, but a later patch adding support for pg_upgrade of publications and
logical slots will rely on that value being properly persisted to disk.

Author: Julien Rouhaud
Reviewed-by: Wang Wei, Peter Smith, Masahiko Sawada
---
 src/backend/access/transam/xlog.c             |   2 +-
 src/backend/replication/slot.c                |  29 +++--
 src/include/replication/slot.h                |  13 ++-
 src/test/subscription/meson.build             |   1 +
 src/test/subscription/t/034_always_persist.pl | 100 ++++++++++++++++++
 5 files changed, 132 insertions(+), 13 deletions(-)
 create mode 100644 src/test/subscription/t/034_always_persist.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 60c0b7ec3a..6dced61cf4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7026,7 +7026,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
-	CheckPointReplicationSlots();
+	CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN);
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 567d61540a..f732d85158 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -70,7 +70,8 @@ static void ReplicationSlotDropPtr(ReplicationSlot *slot);
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
-static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
+static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel,
+						   bool is_shutdown);
 
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
@@ -282,6 +283,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->candidate_xmin_lsn = InvalidXLogRecPtr;
 	slot->candidate_restart_valid = InvalidXLogRecPtr;
 	slot->candidate_restart_lsn = InvalidXLogRecPtr;
+	slot->last_saved_confirmed_flush = InvalidXLogRecPtr;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -744,7 +746,7 @@ ReplicationSlotSave(void)
 	Assert(MyReplicationSlot != NULL);
 
 	sprintf(path, "pg_replslot/%s", NameStr(MyReplicationSlot->data.name));
-	SaveSlotToPath(MyReplicationSlot, path, ERROR);
+	SaveSlotToPath(MyReplicationSlot, path, ERROR, false);
 }
 
 /*
@@ -1533,11 +1535,10 @@ restart:
 /*
  * Flush all replication slots to disk.
  *
- * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * is_shutdown is true in case of a shutdown checkpoint.
  */
 void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)
 {
 	int			i;
 
@@ -1562,7 +1563,7 @@ CheckPointReplicationSlots(void)
 
 		/* save the slot to disk, locking is handled in SaveSlotToPath() */
 		sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
-		SaveSlotToPath(s, path, LOG);
+		SaveSlotToPath(s, path, LOG, is_shutdown);
 	}
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
@@ -1668,7 +1669,7 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 
 	/* Write the actual state file. */
 	slot->dirty = true;			/* signal that we really need to write */
-	SaveSlotToPath(slot, tmppath, ERROR);
+	SaveSlotToPath(slot, tmppath, ERROR, false);
 
 	/* Rename the directory into place. */
 	if (rename(tmppath, path) != 0)
@@ -1694,22 +1695,26 @@ CreateSlotOnDisk(ReplicationSlot *slot)
  * Shared functionality between saving and creating a replication slot.
  */
 static void
-SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
+SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel,
+			   bool is_shutdown)
 {
 	char		tmppath[MAXPGPATH];
 	char		path[MAXPGPATH];
 	int			fd;
 	ReplicationSlotOnDisk cp;
 	bool		was_dirty;
+	bool		flush_lsn_changed;
 
 	/* first check whether there's something to write out */
 	SpinLockAcquire(&slot->mutex);
 	was_dirty = slot->dirty;
 	slot->just_dirtied = false;
+	flush_lsn_changed = (slot->data.confirmed_flush != slot->last_saved_confirmed_flush);
 	SpinLockRelease(&slot->mutex);
 
-	/* and don't do anything if there's nothing to write */
-	if (!was_dirty)
+	/* Don't do anything if there's nothing to write. See ReplicationSlot. */
+	if (!was_dirty &&
+		!(is_shutdown && SlotIsLogical(slot) && flush_lsn_changed))
 		return;
 
 	LWLockAcquire(&slot->io_in_progress_lock, LW_EXCLUSIVE);
@@ -1834,11 +1839,12 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	/*
 	 * Successfully wrote, unset dirty bit, unless somebody dirtied again
-	 * already.
+	 * already and remember the confirmed_flush LSN value.
 	 */
 	SpinLockAcquire(&slot->mutex);
 	if (!slot->just_dirtied)
 		slot->dirty = false;
+	slot->last_saved_confirmed_flush = slot->data.confirmed_flush;
 	SpinLockRelease(&slot->mutex);
 
 	LWLockRelease(&slot->io_in_progress_lock);
@@ -2035,6 +2041,7 @@ RestoreSlotFromDisk(const char *name)
 		/* initialize in memory state */
 		slot->effective_xmin = cp.slotdata.xmin;
 		slot->effective_catalog_xmin = cp.slotdata.catalog_xmin;
+		slot->last_saved_confirmed_flush =  cp.slotdata.confirmed_flush;
 
 		slot->candidate_catalog_xmin = InvalidTransactionId;
 		slot->candidate_xmin_lsn = InvalidXLogRecPtr;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a7d16a37a3..9146d6ede8 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -83,6 +83,17 @@ typedef struct ReplicationSlot
 	XLogRecPtr	candidate_xmin_lsn;
 	XLogRecPtr	candidate_restart_valid;
 	XLogRecPtr	candidate_restart_lsn;
+
+	/*
+	 * We won't ensure that the slot is persisted after the confirmed_flush LSN
+	 * is updated as that could lead to frequent writes.  However, we need to
+	 * ensure that we do persist the slots at the time of shutdown whose
+	 * confirmed_flush LSN is changed since we last saved the slot to disk.
+	 * This will help in avoiding retreat of the confirmed_flush LSN after
+	 * restart.  This variable is used to track the last saved confirmed_flush
+	 * LSN value.
+	 */
+	XLogRecPtr	last_saved_confirmed_flush;
 } ReplicationSlot;
 
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
@@ -146,7 +157,7 @@ extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslo
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
-extern void CheckPointReplicationSlots(void);
+extern void CheckPointReplicationSlots(bool is_shutdown);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
diff --git a/src/test/subscription/meson.build b/src/test/subscription/meson.build
index bd673a9d68..cdd2f8ba47 100644
--- a/src/test/subscription/meson.build
+++ b/src/test/subscription/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/031_column_list.pl',
       't/032_subscribe_use_index.pl',
       't/033_run_as_table_owner.pl',
+      't/034_always_persist.pl',
       't/100_bugs.pl',
     ],
   },
diff --git a/src/test/subscription/t/034_always_persist.pl b/src/test/subscription/t/034_always_persist.pl
new file mode 100644
index 0000000000..7879c2d985
--- /dev/null
+++ b/src/test/subscription/t/034_always_persist.pl
@@ -0,0 +1,100 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test logical replication slots are always persisted to disk during a shutdown
+# checkpoint.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub compare_confirmed_flush
+{
+	my ($node, $confirmed_flush_from_log) = @_;
+
+	# Fetch Latest checkpoint location from the control file.
+	my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $latest_checkpoint = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint = $1;
+			last;
+		}
+	}
+	die "Latest checkpoint location not found in control file\n"
+	  unless defined($latest_checkpoint);
+
+	# Is it same as the value read from log?
+	ok($latest_checkpoint eq $confirmed_flush_from_log,
+		"Check that the slot's confirmed_flush LSN is the same as the latest_checkpoint location");
+
+	return;
+}
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('pub');
+$node_publisher->init(allows_streaming => 'logical');
+# Avoid checkpoint during the test, otherwise, the latest checkpoint location
+# change.
+$node_publisher->append_conf('postgresql.conf', q{
+checkpoint_timeout = 1h
+});
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = PostgreSQL::Test::Cluster->new('sub');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create table
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+
+# Insert some data
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tbl VALUES (generate_series(1, 5));"
+);
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION pub FOR ALL TABLES");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub CONNECTION '$publisher_connstr' PUBLICATION pub"
+);
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub');
+
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tbl"
+);
+
+is($result, qq(5), "check initial copy was done");
+
+my $offset = -s $node_publisher->logfile;
+
+# Restart the publisher to ensure that the slot will be persisted if required.
+$node_publisher->restart();
+
+# Wait until the walsender creates decoding context
+$node_publisher->wait_for_log(
+	qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./,
+	$offset
+);
+
+# Extract confirmed_flush from the logfile
+my $log_contents = slurp_file($node_publisher->logfile, $offset);
+$log_contents =~
+	qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
+	or die "could not get confirmed_flush_lsn";
+
+# Ensure that the slot's confirmed_flush LSN is the same as the
+# latest_checkpoint location.
+compare_confirmed_flush($node_publisher, $1);
+
+done_testing();
-- 
2.28.0.windows.1

#18

vignesh21@gmail.com

over 2 years ago

In reply to: Amit Kapila (#17)

1 attachment(s)

Re: persist logical slots to disk during shutdown checkpoint

On Mon, 28 Aug 2023 at 18:56, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 24, 2023 at 11:44 AM vignesh C <vignesh21@gmail.com> wrote:

The patch looks mostly good to me. I have made minor changes which are
as follows: (a) removed the autovacuum =off and
wal_receiver_status_interval = 0 setting as those doesn't seem to be
required for the test; (b) changed a few comments and variable names
in the code and test;

Shall we change the test file name from always_persist to
save_logical_slots_shutdown and move to recovery/t/ as this test is
about verification after the restart of the server?

That makes sense. The attached v6 version has the changes for the
same, apart from this I have also fixed a) pgindent issues b) perltidy
issues c) one variable change (flush_lsn_changed to
confirmed_flush_has_changed) d) corrected few comments in the test
file. Thanks to Peter for providing few offline comments.

Regards,
Vignesh

Attachments:

v6-0001-Persist-logical-slots-to-disk-during-a-shutdown-c.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Persist-logical-slots-to-disk-during-a-shutdown-c.patchDownload

From 7cb9d7c874397cacbb4fdd6c411f1d93570265f8 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Fri, 14 Apr 2023 13:49:09 +0800
Subject: [PATCH v6] Persist logical slots to disk during a shutdown checkpoint
 if required.

It's entirely possible for a logical slot to have a confirmed_flush LSN
higher than the last value saved on disk while not being marked as dirty.
Currently, it is not a major problem but a later patch adding support for
the upgrade of slots relies on that value being properly persisted to disk.

It can also help with avoiding processing the same transactions again in
some boundary cases after the clean shutdown and restart. Say, we process
some transactions for which we didn't send anything downstream (the
changes got filtered) but the confirm_flush LSN is updated due to
keepalives. As we don't flush the latest value of confirm_flush LSN,
it may lead to processing the same changes again.

Author: Julien Rouhaud, Vignesh C, Kuroda Hayato based on suggestions by
Ashutosh Bapat
Reviewed-by: Amit Kapila, Peter Smith
Discussion: http://postgr.es/m/CAA4eK1JzJagMmb_E8D4au=GYQkxox0AfNBm1FbP7sy7t4YWXPQ@mail.gmail.com
Discussion: http://postgr.es/m/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 src/backend/access/transam/xlog.c             |   2 +-
 src/backend/replication/slot.c                |  29 +++--
 src/include/replication/slot.h                |  13 ++-
 src/test/recovery/meson.build                 |   1 +
 .../t/038_save_logical_slots_shutdown.pl      | 101 ++++++++++++++++++
 5 files changed, 133 insertions(+), 13 deletions(-)
 create mode 100644 src/test/recovery/t/038_save_logical_slots_shutdown.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f6f8adc72a..f26c8d18a6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7039,7 +7039,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
-	CheckPointReplicationSlots();
+	CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN);
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index bb09c4010f..c075f76317 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -109,7 +109,8 @@ static void ReplicationSlotDropPtr(ReplicationSlot *slot);
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
-static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
+static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel,
+						   bool is_shutdown);
 
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
@@ -321,6 +322,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->candidate_xmin_lsn = InvalidXLogRecPtr;
 	slot->candidate_restart_valid = InvalidXLogRecPtr;
 	slot->candidate_restart_lsn = InvalidXLogRecPtr;
+	slot->last_saved_confirmed_flush = InvalidXLogRecPtr;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -783,7 +785,7 @@ ReplicationSlotSave(void)
 	Assert(MyReplicationSlot != NULL);
 
 	sprintf(path, "pg_replslot/%s", NameStr(MyReplicationSlot->data.name));
-	SaveSlotToPath(MyReplicationSlot, path, ERROR);
+	SaveSlotToPath(MyReplicationSlot, path, ERROR, false);
 }
 
 /*
@@ -1572,11 +1574,10 @@ restart:
 /*
  * Flush all replication slots to disk.
  *
- * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * is_shutdown is true in case of a shutdown checkpoint.
  */
 void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)
 {
 	int			i;
 
@@ -1601,7 +1602,7 @@ CheckPointReplicationSlots(void)
 
 		/* save the slot to disk, locking is handled in SaveSlotToPath() */
 		sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
-		SaveSlotToPath(s, path, LOG);
+		SaveSlotToPath(s, path, LOG, is_shutdown);
 	}
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
@@ -1707,7 +1708,7 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 
 	/* Write the actual state file. */
 	slot->dirty = true;			/* signal that we really need to write */
-	SaveSlotToPath(slot, tmppath, ERROR);
+	SaveSlotToPath(slot, tmppath, ERROR, false);
 
 	/* Rename the directory into place. */
 	if (rename(tmppath, path) != 0)
@@ -1733,22 +1734,26 @@ CreateSlotOnDisk(ReplicationSlot *slot)
  * Shared functionality between saving and creating a replication slot.
  */
 static void
-SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
+SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel,
+			   bool is_shutdown)
 {
 	char		tmppath[MAXPGPATH];
 	char		path[MAXPGPATH];
 	int			fd;
 	ReplicationSlotOnDisk cp;
 	bool		was_dirty;
+	bool		confirmed_flush_has_changed;
 
 	/* first check whether there's something to write out */
 	SpinLockAcquire(&slot->mutex);
 	was_dirty = slot->dirty;
 	slot->just_dirtied = false;
+	confirmed_flush_has_changed = (slot->data.confirmed_flush != slot->last_saved_confirmed_flush);
 	SpinLockRelease(&slot->mutex);
 
-	/* and don't do anything if there's nothing to write */
-	if (!was_dirty)
+	/* Don't do anything if there's nothing to write. See ReplicationSlot. */
+	if (!was_dirty &&
+		!(is_shutdown && SlotIsLogical(slot) && confirmed_flush_has_changed))
 		return;
 
 	LWLockAcquire(&slot->io_in_progress_lock, LW_EXCLUSIVE);
@@ -1873,11 +1878,12 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	/*
 	 * Successfully wrote, unset dirty bit, unless somebody dirtied again
-	 * already.
+	 * already and remember the confirmed_flush LSN value.
 	 */
 	SpinLockAcquire(&slot->mutex);
 	if (!slot->just_dirtied)
 		slot->dirty = false;
+	slot->last_saved_confirmed_flush = slot->data.confirmed_flush;
 	SpinLockRelease(&slot->mutex);
 
 	LWLockRelease(&slot->io_in_progress_lock);
@@ -2074,6 +2080,7 @@ RestoreSlotFromDisk(const char *name)
 		/* initialize in memory state */
 		slot->effective_xmin = cp.slotdata.xmin;
 		slot->effective_catalog_xmin = cp.slotdata.catalog_xmin;
+		slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
 
 		slot->candidate_catalog_xmin = InvalidTransactionId;
 		slot->candidate_xmin_lsn = InvalidXLogRecPtr;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8a89dc784..448fb8cf51 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -178,6 +178,17 @@ typedef struct ReplicationSlot
 	XLogRecPtr	candidate_xmin_lsn;
 	XLogRecPtr	candidate_restart_valid;
 	XLogRecPtr	candidate_restart_lsn;
+
+	/*
+	 * We won't ensure that the slot is persisted after the confirmed_flush
+	 * LSN is updated as that could lead to frequent writes.  However, we need
+	 * to ensure that we do persist the slots at the time of shutdown whose
+	 * confirmed_flush LSN is changed since we last saved the slot to disk.
+	 * This will help in avoiding retreat of the confirmed_flush LSN after
+	 * restart.  This variable is used to track the last saved confirmed_flush
+	 * LSN value.
+	 */
+	XLogRecPtr	last_saved_confirmed_flush;
 } ReplicationSlot;
 
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
@@ -241,7 +252,7 @@ extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslo
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
-extern void CheckPointReplicationSlots(void);
+extern void CheckPointReplicationSlots(bool is_shutdown);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index e7328e4894..646d6ffde4 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -43,6 +43,7 @@ tests += {
       't/035_standby_logical_decoding.pl',
       't/036_truncated_dropped.pl',
       't/037_invalid_database.pl',
+      't/038_save_logical_slots_shutdown.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
new file mode 100644
index 0000000000..6e114e9b29
--- /dev/null
+++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
@@ -0,0 +1,101 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test logical replication slots are always persisted to disk during a shutdown
+# checkpoint.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub compare_confirmed_flush
+{
+	my ($node, $confirmed_flush_from_log) = @_;
+
+	# Fetch Latest checkpoint location from the control file
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data      = split("\n", $stdout);
+	my $latest_checkpoint = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint = $1;
+			last;
+		}
+	}
+	die "Latest checkpoint location not found in control file\n"
+	  unless defined($latest_checkpoint);
+
+	# Is it same as the value read from log?
+	ok( $latest_checkpoint eq $confirmed_flush_from_log,
+		"Check that the slot's confirmed_flush LSN is the same as the latest_checkpoint location"
+	);
+
+	return;
+}
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('pub');
+$node_publisher->init(allows_streaming => 'logical');
+# Avoid checkpoint during the test, otherwise, the latest checkpoint location
+# will change.
+$node_publisher->append_conf(
+	'postgresql.conf', q{
+checkpoint_timeout = 1h
+});
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = PostgreSQL::Test::Cluster->new('sub');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create tables
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+
+# Insert some data
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tbl VALUES (generate_series(1, 5));");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION pub FOR ALL TABLES");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub CONNECTION '$publisher_connstr' PUBLICATION pub"
+);
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub');
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tbl");
+
+is($result, qq(5), "check initial copy was done");
+
+my $offset = -s $node_publisher->logfile;
+
+# Restart the publisher to ensure that the slot will be persisted if required
+$node_publisher->restart();
+
+# Wait until the walsender creates decoding context
+$node_publisher->wait_for_log(
+	qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./,
+	$offset);
+
+# Extract confirmed_flush from the logfile
+my $log_contents = slurp_file($node_publisher->logfile, $offset);
+$log_contents =~
+  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
+  or die "could not get confirmed_flush_lsn";
+
+# Ensure that the slot's confirmed_flush LSN is the same as the
+# latest_checkpoint location.
+compare_confirmed_flush($node_publisher, $1);
+
+done_testing();
-- 
2.34.1

#19

amit.kapila16@gmail.com

over 2 years ago

In reply to: vignesh C (#18)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, Aug 29, 2023 at 10:16 AM vignesh C <vignesh21@gmail.com> wrote:

That makes sense. The attached v6 version has the changes for the
same, apart from this I have also fixed a) pgindent issues b) perltidy
issues c) one variable change (flush_lsn_changed to
confirmed_flush_has_changed) d) corrected few comments in the test
file. Thanks to Peter for providing few offline comments.

The latest version looks good to me. Julien, Ashutosh, and others,
unless you have more comments or suggestions, I would like to push
this in a day or two.

--
With Regards,
Amit Kapila.

#20

ashutosh.bapat.oss@gmail.com

over 2 years ago

In reply to: Amit Kapila (#19)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, Aug 29, 2023 at 2:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Aug 29, 2023 at 10:16 AM vignesh C <vignesh21@gmail.com> wrote:

That makes sense. The attached v6 version has the changes for the
same, apart from this I have also fixed a) pgindent issues b) perltidy
issues c) one variable change (flush_lsn_changed to
confirmed_flush_has_changed) d) corrected few comments in the test
file. Thanks to Peter for providing few offline comments.

The latest version looks good to me. Julien, Ashutosh, and others,
unless you have more comments or suggestions, I would like to push
this in a day or two.

I am looking at it. If you can wait till the end of the week, that
will be great.

--
Best Wishes,
Ashutosh Bapat

#21

Julien Rouhaud

rjuju123@gmail.com

over 2 years ago

In reply to: Amit Kapila (#19)

Re: persist logical slots to disk during shutdown checkpoint

Hi,

On Tue, Aug 29, 2023 at 02:21:15PM +0530, Amit Kapila wrote:

On Tue, Aug 29, 2023 at 10:16 AM vignesh C <vignesh21@gmail.com> wrote:

That makes sense. The attached v6 version has the changes for the
same, apart from this I have also fixed a) pgindent issues b) perltidy
issues c) one variable change (flush_lsn_changed to
confirmed_flush_has_changed) d) corrected few comments in the test
file. Thanks to Peter for providing few offline comments.

The latest version looks good to me. Julien, Ashutosh, and others,
unless you have more comments or suggestions, I would like to push
this in a day or two.

Unfortunately I'm currently swamped with some internal escalations so I
couldn't keep up closely with the latest activity here.

I think I recall that you wanted to
change the timing at which logical slots are shutdown, I'm assuming that this
change won't lead to always have a difference between the LSN and latest
persisted LSN being different? Otherwise saving the latest persisted LSN to
try to avoid persisting again all logical slots on shutdown seems reasonable to
me.

#22

amit.kapila16@gmail.com

over 2 years ago

In reply to: Julien Rouhaud (#21)

Re: persist logical slots to disk during shutdown checkpoint

On Wed, Aug 30, 2023 at 9:03 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Tue, Aug 29, 2023 at 02:21:15PM +0530, Amit Kapila wrote:

On Tue, Aug 29, 2023 at 10:16 AM vignesh C <vignesh21@gmail.com> wrote:

That makes sense. The attached v6 version has the changes for the
same, apart from this I have also fixed a) pgindent issues b) perltidy
issues c) one variable change (flush_lsn_changed to
confirmed_flush_has_changed) d) corrected few comments in the test
file. Thanks to Peter for providing few offline comments.

The latest version looks good to me. Julien, Ashutosh, and others,
unless you have more comments or suggestions, I would like to push
this in a day or two.

Unfortunately I'm currently swamped with some internal escalations so I
couldn't keep up closely with the latest activity here.

I think I recall that you wanted to
change the timing at which logical slots are shutdown, I'm assuming that this
change won't lead to always have a difference between the LSN and latest
persisted LSN being different?

I think here by LSN you are referring to confirmed_flush LSN. If so,
this doesn't create any new difference between the values for the
confirmed_flush LSN in memory and in disk. We just remember the last
persisted value to avoid writes of slots at shutdown time.

Otherwise saving the latest persisted LSN to
try to avoid persisting again all logical slots on shutdown seems reasonable to
me.

Thanks for responding.

--
With Regards,
Amit Kapila.

#23

ashutosh.bapat.oss@gmail.com

over 2 years ago

In reply to: Ashutosh Bapat (#20)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, Aug 29, 2023 at 5:40 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

I am looking at it. If you can wait till the end of the week, that
will be great.

     /*
      * Successfully wrote, unset dirty bit, unless somebody dirtied again
-     * already.
+     * already and remember the confirmed_flush LSN value.
      */
     SpinLockAcquire(&slot->mutex);
     if (!slot->just_dirtied)
         slot->dirty = false;
+    slot->last_saved_confirmed_flush = slot->data.confirmed_flush;

If the slot->data.confirmed_flush gets updated concurrently between copying it
to be written to the disk and when it's written to last_saved_confirmed_flush,
we will miss one update. I think we need to update last_saved_confirmed_flush
based on the cp.slotdata.confirmed_flush rather than
slot->data.confirmed_flush.

We are setting last_saved_confirmed_flush for all types of slots but using it
only when the slot is logical. Should we also set it only for logical slots?

/* first check whether there's something to write out */
SpinLockAcquire(&slot->mutex);
was_dirty = slot->dirty;
slot->just_dirtied = false;
+ confirmed_flush_has_changed = (slot->data.confirmed_flush !=
slot->last_saved_confirmed_flush);

The confirmed_flush LSN should always move forward, otherwise there may not be
enough WAL retained for the slot to work. I am wondering whether we should take
an opportunity to make sure
Assert(slot->data.confirmed_flush <= slot->last_saved_confirmed_flush)

-    /* and don't do anything if there's nothing to write */
-    if (!was_dirty)
+    /* Don't do anything if there's nothing to write. See ReplicationSlot. */
+    if (!was_dirty &&
+        !(is_shutdown && SlotIsLogical(slot) && confirmed_flush_has_changed))

Rather than complicating this condition, I wonder whether it's better to just
set was_dirty = true when is_shutdown && SlotIsLogical(slot) &&
confirmed_flush_has_changed) or even slot->dirty = true. See also the note at
the end of the email.

+
+    /*
+     * We won't ensure that the slot is persisted after the confirmed_flush
+     * LSN is updated as that could lead to frequent writes.  However, we need
+     * to ensure that we do persist the slots at the time of shutdown whose
+     * confirmed_flush LSN is changed since we last saved the slot to disk.
+     * This will help in avoiding retreat of the confirmed_flush LSN after
+     * restart.  This variable is used to track the last saved confirmed_flush
+     * LSN value.
+     */

This comment makes more sense in SaveSlotToPath() than here. We may decide to
use last_saved_confirmed_flush for something else in future.
+
+sub compare_confirmed_flush
+{
+    my ($node, $confirmed_flush_from_log) = @_;
+
+    # Fetch Latest checkpoint location from the control file
+    my ($stdout, $stderr) =
+      run_command([ 'pg_controldata', $node->data_dir ]);
+    my @control_data      = split("\n", $stdout);
+    my $latest_checkpoint = undef;
+    foreach (@control_data)
+    {
+        if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+        {
+            $latest_checkpoint = $1;
+            last;
+        }
+    }
+    die "Latest checkpoint location not found in control file\n"
+      unless defined($latest_checkpoint);
+
+    # Is it same as the value read from log?
+    ok( $latest_checkpoint eq $confirmed_flush_from_log,
+        "Check that the slot's confirmed_flush LSN is the same as the
latest_checkpoint location"
+    );

This function assumes that the subscriber will receive and confirm WAL upto
checkpoint location and publisher's WAL sender will update it in the slot.
Where is the code to ensure that? Does the WAL sender process wait for
checkpoint
LSN to be confirmed when shutting down?
+
+# Restart the publisher to ensure that the slot will be persisted if required
+$node_publisher->restart();

Can we add this test comparing LSNs after publisher restart, to an existing
test itself - like basic replication. That's the only extra thing that this
test does beyond usual replication stuff.

+
+# Wait until the walsender creates decoding context
+$node_publisher->wait_for_log(
+    qr/Streaming transactions committing after
([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./,
+    $offset);
+
+# Extract confirmed_flush from the logfile
+my $log_contents = slurp_file($node_publisher->logfile, $offset);
+$log_contents =~
+  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+),
reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
+  or die "could not get confirmed_flush_lsn";

Why are we figuring out the LSN from the log file? Is it not available from
pg_replication_slots view? If we do so, I think this test will fail is the slot
gets written after the restart because of concurrent activity on the publisher
(like autovacuum, or other things that cause empty transaction to be
replicated) and subscriber. A very rare chance but not 0 probability one. I
think we should shut down subscriber, restart publisher and then make this
check based on the contents of the replication slot instead of server log.
Shutting down subscriber will ensure that the subscriber won't send any new
confirmed flush location to the publisher after restart.

All the places which call ReplicationSlotSave() mark the slot as dirty. All
the places where SaveSlotToPath() is called, the slot is marked dirty except
when calling from CheckPointReplicationSlots(). So I am wondering whether we
should be marking the slot dirty in CheckPointReplicationSlots() and avoid
passing down is_shutdown flag to SaveSlotToPath().

Unrelated to this patch, I noticed that all the callers of SaveSlotToPath()
have the same code to craft replication slot file path. I wonder if that needs
to be macro'ised or added to some common function or to be pushed into
SaveSlotToPath() itself to make sure that any changes to the path in future are
consistent for all callers of SaveSlotToPath(). Interestingly slot->data.name
is accessed without a lock here. Name of the slot does not change after
creation so this isn't a problem right now. But generally against the principle
of accessing data protected by a mutex.

--
Best Wishes,
Ashutosh Bapat

#24

amit.kapila16@gmail.com

over 2 years ago

In reply to: Ashutosh Bapat (#23)

Re: persist logical slots to disk during shutdown checkpoint

On Wed, Aug 30, 2023 at 6:33 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Tue, Aug 29, 2023 at 5:40 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

I am looking at it. If you can wait till the end of the week, that
will be great.
/*
* Successfully wrote, unset dirty bit, unless somebody dirtied again
-     * already.
+     * already and remember the confirmed_flush LSN value.
*/
SpinLockAcquire(&slot->mutex);
if (!slot->just_dirtied)
slot->dirty = false;
+    slot->last_saved_confirmed_flush = slot->data.confirmed_flush;
If the slot->data.confirmed_flush gets updated concurrently between copying it
to be written to the disk and when it's written to last_saved_confirmed_flush,
we will miss one update. I think we need to update last_saved_confirmed_flush
based on the cp.slotdata.confirmed_flush rather than
slot->data.confirmed_flush.

Yeah, this appears to be a problem.

We are setting last_saved_confirmed_flush for all types of slots but using it
only when the slot is logical. Should we also set it only for logical slots?

We can do that but not sure if there is any advantage of it other than
adding extra condition. BTW, won't even confirmed_flush LSN be used
only for logical slots?

/* first check whether there's something to write out */
SpinLockAcquire(&slot->mutex);
was_dirty = slot->dirty;
slot->just_dirtied = false;
+ confirmed_flush_has_changed = (slot->data.confirmed_flush !=
slot->last_saved_confirmed_flush);

The confirmed_flush LSN should always move forward, otherwise there may not be
enough WAL retained for the slot to work. I am wondering whether we should take
an opportunity to make sure
Assert(slot->data.confirmed_flush <= slot->last_saved_confirmed_flush)

Theoretically, what you are saying makes sense to me but we don't have
such a protection while updating the confirmed_flush LSN. It would be
better to first add such a protection for confirmed_flush LSN update
as a separate patch.

-    /* and don't do anything if there's nothing to write */
-    if (!was_dirty)
+    /* Don't do anything if there's nothing to write. See ReplicationSlot. */
+    if (!was_dirty &&
+        !(is_shutdown && SlotIsLogical(slot) && confirmed_flush_has_changed))
Rather than complicating this condition, I wonder whether it's better to just
set was_dirty = true when is_shutdown && SlotIsLogical(slot) &&
confirmed_flush_has_changed) or even slot->dirty = true.

I think it is better to keep the slot's dirty property separate. But
we can introduce another variable that can be the result of both
was_dirty and other checks together, however, that doesn't seem much
better than the current code.

+
+    /*
+     * We won't ensure that the slot is persisted after the confirmed_flush
+     * LSN is updated as that could lead to frequent writes.  However, we need
+     * to ensure that we do persist the slots at the time of shutdown whose
+     * confirmed_flush LSN is changed since we last saved the slot to disk.
+     * This will help in avoiding retreat of the confirmed_flush LSN after
+     * restart.  This variable is used to track the last saved confirmed_flush
+     * LSN value.
+     */

This comment makes more sense in SaveSlotToPath() than here. We may decide to
use last_saved_confirmed_flush for something else in future.

I have kept it here because it contains some information that is not
specific to SaveSlotToPath. So, it seems easier to follow the whole
theory if we keep it at the central place in the structure and then
add the reference wherever required but I am fine if you and others
feel strongly about moving this to SaveSlotToPath().

+
+sub compare_confirmed_flush
+{
+    my ($node, $confirmed_flush_from_log) = @_;
+
+    # Fetch Latest checkpoint location from the control file
+    my ($stdout, $stderr) =
+      run_command([ 'pg_controldata', $node->data_dir ]);
+    my @control_data      = split("\n", $stdout);
+    my $latest_checkpoint = undef;
+    foreach (@control_data)
+    {
+        if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+        {
+            $latest_checkpoint = $1;
+            last;
+        }
+    }
+    die "Latest checkpoint location not found in control file\n"
+      unless defined($latest_checkpoint);
+
+    # Is it same as the value read from log?
+    ok( $latest_checkpoint eq $confirmed_flush_from_log,
+        "Check that the slot's confirmed_flush LSN is the same as the
latest_checkpoint location"
+    );

This function assumes that the subscriber will receive and confirm WAL upto
checkpoint location and publisher's WAL sender will update it in the slot.
Where is the code to ensure that? Does the WAL sender process wait for
checkpoint
LSN to be confirmed when shutting down?

Note, that we need to compare if all the WAL before the
shutdown_checkpoint WAL record is sent. Before the clean shutdown, we
do ensure that all the pending WAL is confirmed back. See the use of
WalSndDone() in WalSndLoop().

+
+# Restart the publisher to ensure that the slot will be persisted if required
+$node_publisher->restart();
Can we add this test comparing LSNs after publisher restart, to an existing
test itself - like basic replication. That's the only extra thing that this
test does beyond usual replication stuff.

As this is a test after the restart of the server, I thought to keep
it with recovery tests. However, I think once the upgrade (of
publisher nodes) patch is ready, we should keep this test with those
tests or somehow merge it with those tests but till that patch is
ready, let's keep this as a separate test.

+
+# Wait until the walsender creates decoding context
+$node_publisher->wait_for_log(
+    qr/Streaming transactions committing after
([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./,
+    $offset);
+
+# Extract confirmed_flush from the logfile
+my $log_contents = slurp_file($node_publisher->logfile, $offset);
+$log_contents =~
+  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+),
reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
+  or die "could not get confirmed_flush_lsn";
Why are we figuring out the LSN from the log file? Is it not available from
pg_replication_slots view? If we do so, I think this test will fail is the slot
gets written after the restart because of concurrent activity on the publisher
(like autovacuum, or other things that cause empty transaction to be
replicated) and subscriber. A very rare chance but not 0 probability one.

Yes, that is a possibility.

I
think we should shut down subscriber, restart publisher and then make this
check based on the contents of the replication slot instead of server log.
Shutting down subscriber will ensure that the subscriber won't send any new
confirmed flush location to the publisher after restart.

But if we shutdown the subscriber before the publisher there is no
guarantee that the publisher has sent all outstanding logs up to the
shutdown checkpoint record (i.e., the latest record). Such a guarantee
can only be there if we do a clean shutdown of the publisher before
the subscriber.

All the places which call ReplicationSlotSave() mark the slot as dirty. All
the places where SaveSlotToPath() is called, the slot is marked dirty except
when calling from CheckPointReplicationSlots(). So I am wondering whether we
should be marking the slot dirty in CheckPointReplicationSlots() and avoid
passing down is_shutdown flag to SaveSlotToPath().

I feel that will add another spinlock acquire/release pair without
much benefit. Sure, it may not be performance-sensitive but still
adding another pair of lock/release doesn't seem like a better idea.

--
With Regards,
Amit Kapila.

#25

ashutosh.bapat.oss@gmail.com

over 2 years ago

In reply to: Amit Kapila (#24)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Aug 31, 2023 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

+
+    /*
+     * We won't ensure that the slot is persisted after the confirmed_flush
+     * LSN is updated as that could lead to frequent writes.  However, we need
+     * to ensure that we do persist the slots at the time of shutdown whose
+     * confirmed_flush LSN is changed since we last saved the slot to disk.
+     * This will help in avoiding retreat of the confirmed_flush LSN after
+     * restart.  This variable is used to track the last saved confirmed_flush
+     * LSN value.
+     */
This comment makes more sense in SaveSlotToPath() than here. We may decide to
use last_saved_confirmed_flush for something else in future.
I have kept it here because it contains some information that is not
specific to SaveSlotToPath. So, it seems easier to follow the whole
theory if we keep it at the central place in the structure and then
add the reference wherever required but I am fine if you and others
feel strongly about moving this to SaveSlotToPath().

Saving slot to disk happens only in SaveSlotToPath, so except the last
sentence rest of the comment makes sense in SaveSlotToPath().

This function assumes that the subscriber will receive and confirm WAL upto
checkpoint location and publisher's WAL sender will update it in the slot.
Where is the code to ensure that? Does the WAL sender process wait for
checkpoint
LSN to be confirmed when shutting down?

Note, that we need to compare if all the WAL before the
shutdown_checkpoint WAL record is sent. Before the clean shutdown, we
do ensure that all the pending WAL is confirmed back. See the use of
WalSndDone() in WalSndLoop().

Ok. Thanks for pointing that out to me.

I
think we should shut down subscriber, restart publisher and then make this
check based on the contents of the replication slot instead of server log.
Shutting down subscriber will ensure that the subscriber won't send any new
confirmed flush location to the publisher after restart.

But if we shutdown the subscriber before the publisher there is no
guarantee that the publisher has sent all outstanding logs up to the
shutdown checkpoint record (i.e., the latest record). Such a guarantee
can only be there if we do a clean shutdown of the publisher before
the subscriber.

So the sequence is shutdown publisher node, shutdown subscriber node,
start publisher node and carry out the checks.

All the places which call ReplicationSlotSave() mark the slot as dirty. All
the places where SaveSlotToPath() is called, the slot is marked dirty except
when calling from CheckPointReplicationSlots(). So I am wondering whether we
should be marking the slot dirty in CheckPointReplicationSlots() and avoid
passing down is_shutdown flag to SaveSlotToPath().

I feel that will add another spinlock acquire/release pair without
much benefit. Sure, it may not be performance-sensitive but still
adding another pair of lock/release doesn't seem like a better idea.

We call ReplicatioinSlotMarkDirty() followed by ReplicationSlotSave()
at all the places, even those which are more frequent than this. So I
think it's better to stick to that protocol rather than adding a new
flag.

--
Best Wishes,
Ashutosh Bapat

#26

amit.kapila16@gmail.com

over 2 years ago

In reply to: Ashutosh Bapat (#25)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Aug 31, 2023 at 12:25 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Aug 31, 2023 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
+
+    /*
+     * We won't ensure that the slot is persisted after the confirmed_flush
+     * LSN is updated as that could lead to frequent writes.  However, we need
+     * to ensure that we do persist the slots at the time of shutdown whose
+     * confirmed_flush LSN is changed since we last saved the slot to disk.
+     * This will help in avoiding retreat of the confirmed_flush LSN after
+     * restart.  This variable is used to track the last saved confirmed_flush
+     * LSN value.
+     */
This comment makes more sense in SaveSlotToPath() than here. We may decide to
use last_saved_confirmed_flush for something else in future.
I have kept it here because it contains some information that is not
specific to SaveSlotToPath. So, it seems easier to follow the whole
theory if we keep it at the central place in the structure and then
add the reference wherever required but I am fine if you and others
feel strongly about moving this to SaveSlotToPath().
Saving slot to disk happens only in SaveSlotToPath, so except the last
sentence rest of the comment makes sense in SaveSlotToPath().

This function assumes that the subscriber will receive and confirm WAL upto
checkpoint location and publisher's WAL sender will update it in the slot.
Where is the code to ensure that? Does the WAL sender process wait for
checkpoint
LSN to be confirmed when shutting down?

Note, that we need to compare if all the WAL before the
shutdown_checkpoint WAL record is sent. Before the clean shutdown, we
do ensure that all the pending WAL is confirmed back. See the use of
WalSndDone() in WalSndLoop().

Ok. Thanks for pointing that out to me.

I
think we should shut down subscriber, restart publisher and then make this
check based on the contents of the replication slot instead of server log.
Shutting down subscriber will ensure that the subscriber won't send any new
confirmed flush location to the publisher after restart.

But if we shutdown the subscriber before the publisher there is no
guarantee that the publisher has sent all outstanding logs up to the
shutdown checkpoint record (i.e., the latest record). Such a guarantee
can only be there if we do a clean shutdown of the publisher before
the subscriber.

So the sequence is shutdown publisher node, shutdown subscriber node,
start publisher node and carry out the checks.

This can probably work but I still prefer the current approach as that
will be closer to the ideal values on the disk instead of comparison
with a later in-memory value of confirmed_flush LSN. Ideally, if we
would have a tool like pg_replslotdata which can read the on-disk
state of slots that would be better but missing that, the current one
sounds like the next best possibility. Do you see any problem with the
current approach of test?

BTW, I think we can keep autovacuum = off for this test just to avoid
any extra record generation even though that doesn't matter for the
purpose of test.

All the places which call ReplicationSlotSave() mark the slot as dirty. All
the places where SaveSlotToPath() is called, the slot is marked dirty except
when calling from CheckPointReplicationSlots(). So I am wondering whether we
should be marking the slot dirty in CheckPointReplicationSlots() and avoid
passing down is_shutdown flag to SaveSlotToPath().

I feel that will add another spinlock acquire/release pair without
much benefit. Sure, it may not be performance-sensitive but still
adding another pair of lock/release doesn't seem like a better idea.

We call ReplicatioinSlotMarkDirty() followed by ReplicationSlotSave()
at all the places, even those which are more frequent than this.

All but one. Normally, the idea of marking dirty is to indicate that
we will actually write/flush the contents at a later point (except
when required for correctness) as even indicated in the comments atop
ReplicatioinSlotMarkDirty(). However, I see your point that we use
that protocol at all the current places including CreateSlotOnDisk().
So, we can probably do it here as well.

--
With Regards,
Amit Kapila.

#27

ashutosh.bapat.oss@gmail.com

over 2 years ago

In reply to: Amit Kapila (#26)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I
think we should shut down subscriber, restart publisher and then make this
check based on the contents of the replication slot instead of server log.
Shutting down subscriber will ensure that the subscriber won't send any new
confirmed flush location to the publisher after restart.

But if we shutdown the subscriber before the publisher there is no
guarantee that the publisher has sent all outstanding logs up to the
shutdown checkpoint record (i.e., the latest record). Such a guarantee
can only be there if we do a clean shutdown of the publisher before
the subscriber.

So the sequence is shutdown publisher node, shutdown subscriber node,
start publisher node and carry out the checks.

This can probably work but I still prefer the current approach as that
will be closer to the ideal values on the disk instead of comparison
with a later in-memory value of confirmed_flush LSN. Ideally, if we
would have a tool like pg_replslotdata which can read the on-disk
state of slots that would be better but missing that, the current one
sounds like the next best possibility. Do you see any problem with the
current approach of test?

+ qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+),
reading WAL from ([A-F0-9]+\/[A-F0-9]+)./

I don't think the LSN reported in this message is guaranteed to be the
confirmed_flush LSN of the slot. It's usually confirmed_flush but not
always. It's the LSN that snapshot builder computes based on factors
including confirmed_flush. There's a chance that this test will fail
sometimes because of this behaviour. Reading directly from
replication slot is better that this. pg_replslotdata might help if we
read replication slot content between shutdown and restart of
publisher.

BTW, I think we can keep autovacuum = off for this test just to avoid
any extra record generation even though that doesn't matter for the
purpose of test.

Autovacuum is one thing, but we can't guarantee the absence of any
concurrent activity forever.

All the places which call ReplicationSlotSave() mark the slot as dirty. All
the places where SaveSlotToPath() is called, the slot is marked dirty except
when calling from CheckPointReplicationSlots(). So I am wondering whether we
should be marking the slot dirty in CheckPointReplicationSlots() and avoid
passing down is_shutdown flag to SaveSlotToPath().

I feel that will add another spinlock acquire/release pair without
much benefit. Sure, it may not be performance-sensitive but still
adding another pair of lock/release doesn't seem like a better idea.

We call ReplicatioinSlotMarkDirty() followed by ReplicationSlotSave()
at all the places, even those which are more frequent than this.

All but one. Normally, the idea of marking dirty is to indicate that
we will actually write/flush the contents at a later point (except
when required for correctness) as even indicated in the comments atop
ReplicatioinSlotMarkDirty(). However, I see your point that we use
that protocol at all the current places including CreateSlotOnDisk().
So, we can probably do it here as well.

yes

I didn't see this entry in commitfest. Since we are discussing it and
the next CF is about to begin, probably it's good to add one there.
--
Best Wishes,
Ashutosh Bapat

#28

amit.kapila16@gmail.com

over 2 years ago

In reply to: Ashutosh Bapat (#27)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I
think we should shut down subscriber, restart publisher and then make this
check based on the contents of the replication slot instead of server log.
Shutting down subscriber will ensure that the subscriber won't send any new
confirmed flush location to the publisher after restart.

But if we shutdown the subscriber before the publisher there is no
guarantee that the publisher has sent all outstanding logs up to the
shutdown checkpoint record (i.e., the latest record). Such a guarantee
can only be there if we do a clean shutdown of the publisher before
the subscriber.

So the sequence is shutdown publisher node, shutdown subscriber node,
start publisher node and carry out the checks.

This can probably work but I still prefer the current approach as that
will be closer to the ideal values on the disk instead of comparison
with a later in-memory value of confirmed_flush LSN. Ideally, if we
would have a tool like pg_replslotdata which can read the on-disk
state of slots that would be better but missing that, the current one
sounds like the next best possibility. Do you see any problem with the
current approach of test?

+ qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+),
reading WAL from ([A-F0-9]+\/[A-F0-9]+)./

I don't think the LSN reported in this message is guaranteed to be the
confirmed_flush LSN of the slot. It's usually confirmed_flush but not
always. It's the LSN that snapshot builder computes based on factors
including confirmed_flush. There's a chance that this test will fail
sometimes because of this behaviour.

I think I am missing something here because as per my understanding,
the LOG referred by the test is generated in CreateDecodingContext()
before which we shouldn't be changing the slot's confirmed_flush LSN.
The LOG [1]errdetail("Streaming transactions committing after %X/%X, reading WAL from %X/%X.", LSN_FORMAT_ARGS(slot->data.confirmed_flush), refers to the slot's persistent value for confirmed_flush,
so how it could be different from what the test is expecting.

[1]: errdetail("Streaming transactions committing after %X/%X, reading WAL from %X/%X.", LSN_FORMAT_ARGS(slot->data.confirmed_flush),
errdetail("Streaming transactions committing after %X/%X, reading WAL
from %X/%X.",
LSN_FORMAT_ARGS(slot->data.confirmed_flush),

--
With Regards,
Amit Kapila.

#29

amit.kapila16@gmail.com

over 2 years ago

In reply to: Ashutosh Bapat (#27)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

All but one. Normally, the idea of marking dirty is to indicate that
we will actually write/flush the contents at a later point (except
when required for correctness) as even indicated in the comments atop
ReplicatioinSlotMarkDirty(). However, I see your point that we use
that protocol at all the current places including CreateSlotOnDisk().
So, we can probably do it here as well.

yes

I think we should also ensure that slots are not invalidated
(slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
because we don't allow decoding from such slots, so we shouldn't
include those.

--
With Regards,
Amit Kapila.

#30

vignesh21@gmail.com

over 2 years ago

In reply to: Amit Kapila (#29)

1 attachment(s)

Re: persist logical slots to disk during shutdown checkpoint

On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

All but one. Normally, the idea of marking dirty is to indicate that
we will actually write/flush the contents at a later point (except
when required for correctness) as even indicated in the comments atop
ReplicatioinSlotMarkDirty(). However, I see your point that we use
that protocol at all the current places including CreateSlotOnDisk().
So, we can probably do it here as well.

yes

I think we should also ensure that slots are not invalidated
(slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
because we don't allow decoding from such slots, so we shouldn't
include those.

Added this check.

Apart from this I have also fixed the following issues that were
agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
instead of setting it in SaveSlotToPath b) The comments were moved
from ReplicationSlot and moved to CheckPointReplicationSlots c) Tests
will be run in autovacuum = off d) updating last_saved_confirmed_flush
based on cp.slotdata.confirmed_flush rather than
slot->data.confirmed_flush.
I have also added the commitfest entry for this at [1]https://commitfest.postgresql.org/44/4536/.

Thanks to Ashutosh/Amit for the feedback.
Attached v7 version patch has the changes for the same.
[1]: https://commitfest.postgresql.org/44/4536/

Regards,
Vignesh

Attachments:

v7-0001-Persist-logical-slots-to-disk-during-a-shutdown-c.patchapplication/x-patch; name=v7-0001-Persist-logical-slots-to-disk-during-a-shutdown-c.patchDownload

From 328f500c8d2a1980415c889abcdc398741bebe10 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Fri, 14 Apr 2023 13:49:09 +0800
Subject: [PATCH v7] Persist logical slots to disk during a shutdown checkpoint
 if required.

It's entirely possible for a logical slot to have a confirmed_flush LSN
higher than the last value saved on disk while not being marked as dirty.
Currently, it is not a major problem but a later patch adding support for
the upgrade of slots relies on that value being properly persisted to disk.

It can also help with avoiding processing the same transactions again in
some boundary cases after the clean shutdown and restart. Say, we process
some transactions for which we didn't send anything downstream (the
changes got filtered) but the confirm_flush LSN is updated due to
keepalives. As we don't flush the latest value of confirm_flush LSN,
it may lead to processing the same changes again.

Author: Julien Rouhaud, Vignesh C, Kuroda Hayato based on suggestions by
Ashutosh Bapat
Reviewed-by: Amit Kapila, Peter Smith
Discussion: http://postgr.es/m/CAA4eK1JzJagMmb_E8D4au=GYQkxox0AfNBm1FbP7sy7t4YWXPQ@mail.gmail.com
Discussion: http://postgr.es/m/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 src/backend/access/transam/xlog.c             |   2 +-
 src/backend/replication/slot.c                |  28 ++++-
 src/include/replication/slot.h                |   5 +-
 src/test/recovery/meson.build                 |   1 +
 .../t/038_save_logical_slots_shutdown.pl      | 102 ++++++++++++++++++
 5 files changed, 132 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/038_save_logical_slots_shutdown.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f6f8adc72a..f26c8d18a6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7039,7 +7039,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
-	CheckPointReplicationSlots();
+	CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN);
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index bb09c4010f..185a58de4c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -321,6 +321,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->candidate_xmin_lsn = InvalidXLogRecPtr;
 	slot->candidate_restart_valid = InvalidXLogRecPtr;
 	slot->candidate_restart_lsn = InvalidXLogRecPtr;
+	slot->last_saved_confirmed_flush = InvalidXLogRecPtr;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -1572,11 +1573,10 @@ restart:
 /*
  * Flush all replication slots to disk.
  *
- * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * is_shutdown is true in case of a shutdown checkpoint.
  */
 void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)
 {
 	int			i;
 
@@ -1601,6 +1601,24 @@ CheckPointReplicationSlots(void)
 
 		/* save the slot to disk, locking is handled in SaveSlotToPath() */
 		sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
+
+		/*
+		 * We won't ensure that the slot is persisted after the
+		 * confirmed_flush LSN is updated as that could lead to frequent
+		 * writes.  However, we need to ensure that we do persist the slots at
+		 * the time of shutdown whose confirmed_flush LSN is changed since we
+		 * last saved the slot to disk. This will help in avoiding retreat of
+		 * the confirmed_flush LSN after restart.
+		 */
+		if (is_shutdown && SlotIsLogical(s))
+		{
+			SpinLockAcquire(&s->mutex);
+			if (s->data.invalidated == RS_INVAL_NONE &&
+				s->data.confirmed_flush != s->last_saved_confirmed_flush)
+				s->dirty = true;
+			SpinLockRelease(&s->mutex);
+		}
+
 		SaveSlotToPath(s, path, LOG);
 	}
 	LWLockRelease(ReplicationSlotAllocationLock);
@@ -1873,11 +1891,12 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	/*
 	 * Successfully wrote, unset dirty bit, unless somebody dirtied again
-	 * already.
+	 * already and remember the confirmed_flush LSN value.
 	 */
 	SpinLockAcquire(&slot->mutex);
 	if (!slot->just_dirtied)
 		slot->dirty = false;
+	slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
 	SpinLockRelease(&slot->mutex);
 
 	LWLockRelease(&slot->io_in_progress_lock);
@@ -2074,6 +2093,7 @@ RestoreSlotFromDisk(const char *name)
 		/* initialize in memory state */
 		slot->effective_xmin = cp.slotdata.xmin;
 		slot->effective_catalog_xmin = cp.slotdata.catalog_xmin;
+		slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
 
 		slot->candidate_catalog_xmin = InvalidTransactionId;
 		slot->candidate_xmin_lsn = InvalidXLogRecPtr;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8a89dc784..da8978342a 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -178,6 +178,9 @@ typedef struct ReplicationSlot
 	XLogRecPtr	candidate_xmin_lsn;
 	XLogRecPtr	candidate_restart_valid;
 	XLogRecPtr	candidate_restart_lsn;
+
+	/* This is used to track the last saved confirmed_flush LSN value */
+	XLogRecPtr	last_saved_confirmed_flush;
 } ReplicationSlot;
 
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
@@ -241,7 +244,7 @@ extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslo
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
-extern void CheckPointReplicationSlots(void);
+extern void CheckPointReplicationSlots(bool is_shutdown);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index e7328e4894..646d6ffde4 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -43,6 +43,7 @@ tests += {
       't/035_standby_logical_decoding.pl',
       't/036_truncated_dropped.pl',
       't/037_invalid_database.pl',
+      't/038_save_logical_slots_shutdown.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
new file mode 100644
index 0000000000..224a840a61
--- /dev/null
+++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
@@ -0,0 +1,102 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test logical replication slots are always persisted to disk during a shutdown
+# checkpoint.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub compare_confirmed_flush
+{
+	my ($node, $confirmed_flush_from_log) = @_;
+
+	# Fetch Latest checkpoint location from the control file
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data      = split("\n", $stdout);
+	my $latest_checkpoint = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint = $1;
+			last;
+		}
+	}
+	die "Latest checkpoint location not found in control file\n"
+	  unless defined($latest_checkpoint);
+
+	# Is it same as the value read from log?
+	ok( $latest_checkpoint eq $confirmed_flush_from_log,
+		"Check that the slot's confirmed_flush LSN is the same as the latest_checkpoint location"
+	);
+
+	return;
+}
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('pub');
+$node_publisher->init(allows_streaming => 'logical');
+# Avoid checkpoint during the test, otherwise, the latest checkpoint location
+# will change.
+$node_publisher->append_conf(
+	'postgresql.conf', q{
+checkpoint_timeout = 1h
+autovacuum = off
+});
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = PostgreSQL::Test::Cluster->new('sub');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create tables
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+
+# Insert some data
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tbl VALUES (generate_series(1, 5));");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION pub FOR ALL TABLES");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub CONNECTION '$publisher_connstr' PUBLICATION pub"
+);
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub');
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tbl");
+
+is($result, qq(5), "check initial copy was done");
+
+my $offset = -s $node_publisher->logfile;
+
+# Restart the publisher to ensure that the slot will be persisted if required
+$node_publisher->restart();
+
+# Wait until the walsender creates decoding context
+$node_publisher->wait_for_log(
+	qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./,
+	$offset);
+
+# Extract confirmed_flush from the logfile
+my $log_contents = slurp_file($node_publisher->logfile, $offset);
+$log_contents =~
+  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
+  or die "could not get confirmed_flush_lsn";
+
+# Ensure that the slot's confirmed_flush LSN is the same as the
+# latest_checkpoint location.
+compare_confirmed_flush($node_publisher, $1);
+
+done_testing();
-- 
2.34.1

#31

ashutosh.bapat.oss@gmail.com

over 2 years ago

In reply to: Amit Kapila (#28)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Aug 31, 2023 at 7:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I
think we should shut down subscriber, restart publisher and then make this
check based on the contents of the replication slot instead of server log.
Shutting down subscriber will ensure that the subscriber won't send any new
confirmed flush location to the publisher after restart.

But if we shutdown the subscriber before the publisher there is no
guarantee that the publisher has sent all outstanding logs up to the
shutdown checkpoint record (i.e., the latest record). Such a guarantee
can only be there if we do a clean shutdown of the publisher before
the subscriber.

So the sequence is shutdown publisher node, shutdown subscriber node,
start publisher node and carry out the checks.

This can probably work but I still prefer the current approach as that
will be closer to the ideal values on the disk instead of comparison
with a later in-memory value of confirmed_flush LSN. Ideally, if we
would have a tool like pg_replslotdata which can read the on-disk
state of slots that would be better but missing that, the current one
sounds like the next best possibility. Do you see any problem with the
current approach of test?

+ qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+),
reading WAL from ([A-F0-9]+\/[A-F0-9]+)./

I don't think the LSN reported in this message is guaranteed to be the
confirmed_flush LSN of the slot. It's usually confirmed_flush but not
always. It's the LSN that snapshot builder computes based on factors
including confirmed_flush. There's a chance that this test will fail
sometimes because of this behaviour.

I think I am missing something here because as per my understanding,
the LOG referred by the test is generated in CreateDecodingContext()
before which we shouldn't be changing the slot's confirmed_flush LSN.
The LOG [1] refers to the slot's persistent value for confirmed_flush,
so how it could be different from what the test is expecting.

[1]
errdetail("Streaming transactions committing after %X/%X, reading WAL
from %X/%X.",
LSN_FORMAT_ARGS(slot->data.confirmed_flush),

I was afraid that we may move confirmed_flush while creating the
snapshot builder when creating the decoding context. But I don't see
any code doing that. So may be we are safe. But if the log message
changes, this test would fail - depending upon the log message looks a
bit fragile, esp. when we have a way to access the data directly
reliably.

--
Best Wishes,
Ashutosh Bapat

#32

amit.kapila16@gmail.com

over 2 years ago

In reply to: Ashutosh Bapat (#31)

Re: persist logical slots to disk during shutdown checkpoint

On Fri, Sep 1, 2023 at 1:11 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Aug 31, 2023 at 7:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

+ qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+),
reading WAL from ([A-F0-9]+\/[A-F0-9]+)./

I don't think the LSN reported in this message is guaranteed to be the
confirmed_flush LSN of the slot. It's usually confirmed_flush but not
always. It's the LSN that snapshot builder computes based on factors
including confirmed_flush. There's a chance that this test will fail
sometimes because of this behaviour.

I think I am missing something here because as per my understanding,
the LOG referred by the test is generated in CreateDecodingContext()
before which we shouldn't be changing the slot's confirmed_flush LSN.
The LOG [1] refers to the slot's persistent value for confirmed_flush,
so how it could be different from what the test is expecting.

[1]
errdetail("Streaming transactions committing after %X/%X, reading WAL
from %X/%X.",
LSN_FORMAT_ARGS(slot->data.confirmed_flush),

I was afraid that we may move confirmed_flush while creating the
snapshot builder when creating the decoding context. But I don't see
any code doing that. So may be we are safe.

We are safe in that respect. As far as I understand there is no reason
to be worried.

But if the log message

changes, this test would fail - depending upon the log message looks a
bit fragile, esp. when we have a way to access the data directly
reliably.

This message is there from the very begining (b89e1510) and I can't
forsee a reason to change such a message. But even if we change, we
can always change the test output or test accordingly, if required. I
think it is a matter of preference to which way we can write the test,
so let's not argue too much on this. I find current way slightly more
reliable but we can change it if we see any problem.

--
With Regards,
Amit Kapila.

#33

amit.kapila16@gmail.com

over 2 years ago

In reply to: vignesh C (#30)

Re: persist logical slots to disk during shutdown checkpoint

On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

All but one. Normally, the idea of marking dirty is to indicate that
we will actually write/flush the contents at a later point (except
when required for correctness) as even indicated in the comments atop
ReplicatioinSlotMarkDirty(). However, I see your point that we use
that protocol at all the current places including CreateSlotOnDisk().
So, we can probably do it here as well.

yes

I think we should also ensure that slots are not invalidated
(slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
because we don't allow decoding from such slots, so we shouldn't
include those.

Added this check.

Apart from this I have also fixed the following issues that were
agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
instead of setting it in SaveSlotToPath

+ if (is_shutdown && SlotIsLogical(s))
+ {
+ SpinLockAcquire(&s->mutex);
+ if (s->data.invalidated == RS_INVAL_NONE &&
+ s->data.confirmed_flush != s->last_saved_confirmed_flush)
+ s->dirty = true;

I think it is better to use ReplicationSlotMarkDirty() as that would
be consistent with all other usages.

--
With Regards,
Amit Kapila.

#34

vignesh21@gmail.com

over 2 years ago

In reply to: Amit Kapila (#33)

Re: persist logical slots to disk during shutdown checkpoint

On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

All but one. Normally, the idea of marking dirty is to indicate that
we will actually write/flush the contents at a later point (except
when required for correctness) as even indicated in the comments atop
ReplicatioinSlotMarkDirty(). However, I see your point that we use
that protocol at all the current places including CreateSlotOnDisk().
So, we can probably do it here as well.

yes

I think we should also ensure that slots are not invalidated
(slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
because we don't allow decoding from such slots, so we shouldn't
include those.

Added this check.

Apart from this I have also fixed the following issues that were
agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
instead of setting it in SaveSlotToPath
+ if (is_shutdown && SlotIsLogical(s))
+ {
+ SpinLockAcquire(&s->mutex);
+ if (s->data.invalidated == RS_INVAL_NONE &&
+ s->data.confirmed_flush != s->last_saved_confirmed_flush)
+ s->dirty = true;
I think it is better to use ReplicationSlotMarkDirty() as that would
be consistent with all other usages.

ReplicationSlotMarkDirty works only on MyReplicationSlot whereas
CheckpointReplicationSlots loops through all the slots and marks the
appropriate slot as dirty, we might have to change
ReplicationSlotMarkDirty to take the slot as input parameter and all
caller should pass MyReplicationSlot. Another thing is we have already
taken spin lock to access last_confirmed_flush_lsn from
CheckpointReplicationSlots, we could set dirty flag here itself, else
we will have to release the lock and call ReplicationSlotMarkDirty
which will take lock again. Instead shall we set just_dirtied also in
CheckpointReplicationSlots?
Thoughts?

Regards,
Vignesh

#35

Zhijie Hou (Fujitsu)

houzj.fnst@fujitsu.com

over 2 years ago

In reply to: vignesh C (#34)

RE: persist logical slots to disk during shutdown checkpoint

On Monday, September 4, 2023 6:15 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila

<amit.kapila16@gmail.com> wrote:

All but one. Normally, the idea of marking dirty is to
indicate that we will actually write/flush the contents at a
later point (except when required for correctness) as even
indicated in the comments atop ReplicatioinSlotMarkDirty().
However, I see your point that we use that protocol at all the current

places including CreateSlotOnDisk().
So, we can probably do it here as well.

yes

I think we should also ensure that slots are not invalidated
(slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
because we don't allow decoding from such slots, so we shouldn't
include those.

Added this check.

Apart from this I have also fixed the following issues that were
agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
instead of setting it in SaveSlotToPath
+ if (is_shutdown && SlotIsLogical(s)) { SpinLockAcquire(&s->mutex);
+ if (s->data.invalidated == RS_INVAL_NONE &&
+ s->data.confirmed_flush != s->last_saved_confirmed_flush) dirty =
+ s->true;
I think it is better to use ReplicationSlotMarkDirty() as that would
be consistent with all other usages.
ReplicationSlotMarkDirty works only on MyReplicationSlot whereas
CheckpointReplicationSlots loops through all the slots and marks the
appropriate slot as dirty, we might have to change ReplicationSlotMarkDirty to
take the slot as input parameter and all caller should pass MyReplicationSlot.

Personally, I feel if we want to centralize the code of marking dirty into a
function, we can introduce a new static function MarkSlotDirty(slot) to mark
passed slot dirty and let ReplicationSlotMarkDirty and
CheckpointReplicationSlots call it. Like:

void
ReplicationSlotMarkDirty(void)
{
MarkSlotDirty(MyReplicationSlot);
}

+static void
+MarkSlotDirty(ReplicationSlot *slot)
+{
+	Assert(slot != NULL);
+
+	SpinLockAcquire(&slot->mutex);
+	slot->just_dirtied = true;
+	slot->dirty = true;
+	SpinLockRelease(&slot->mutex);
+}

This is somewhat similar to the relation between ReplicationSlotSave(serialize
my backend's replications slot) and SaveSlotToPath(save the passed slot).

Another thing is we have already taken spin lock to access
last_confirmed_flush_lsn from CheckpointReplicationSlots, we could set dirty
flag here itself, else we will have to release the lock and call
ReplicationSlotMarkDirty which will take lock again.

Yes, this is unavoidable, but maybe it's not a big problem as
we only do it at shutdown.

Instead shall we set just_dirtied also in CheckpointReplicationSlots?
Thoughts?

I agree we'd better set just_dirtied to true to ensure we will serialize slot
info here, because if some other processes just serialized the slot, the dirty
flag will be reset to false if we don't set just_dirtied to true in
CheckpointReplicationSlots(), this race condition may not exists for now, but
seems better to completely forbid it by setting just_dirtied.

Best Regards,
Hou zj

#36

amit.kapila16@gmail.com

over 2 years ago

In reply to: Zhijie Hou (Fujitsu) (#35)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, Sep 5, 2023 at 7:54 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Monday, September 4, 2023 6:15 PM vignesh C <vignesh21@gmail.com> wrote:
On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila

<amit.kapila16@gmail.com> wrote:

All but one. Normally, the idea of marking dirty is to
indicate that we will actually write/flush the contents at a
later point (except when required for correctness) as even
indicated in the comments atop ReplicatioinSlotMarkDirty().
However, I see your point that we use that protocol at all the current

places including CreateSlotOnDisk().
So, we can probably do it here as well.

yes

I think we should also ensure that slots are not invalidated
(slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
because we don't allow decoding from such slots, so we shouldn't
include those.

Added this check.

Apart from this I have also fixed the following issues that were
agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
instead of setting it in SaveSlotToPath
+ if (is_shutdown && SlotIsLogical(s)) { SpinLockAcquire(&s->mutex);
+ if (s->data.invalidated == RS_INVAL_NONE &&
+ s->data.confirmed_flush != s->last_saved_confirmed_flush) dirty =
+ s->true;
I think it is better to use ReplicationSlotMarkDirty() as that would
be consistent with all other usages.
ReplicationSlotMarkDirty works only on MyReplicationSlot whereas
CheckpointReplicationSlots loops through all the slots and marks the
appropriate slot as dirty, we might have to change ReplicationSlotMarkDirty to
take the slot as input parameter and all caller should pass MyReplicationSlot.
Personally, I feel if we want to centralize the code of marking dirty into a
function, we can introduce a new static function MarkSlotDirty(slot) to mark
passed slot dirty and let ReplicationSlotMarkDirty and
CheckpointReplicationSlots call it. Like:

void
ReplicationSlotMarkDirty(void)
{
MarkSlotDirty(MyReplicationSlot);
}
+static void
+MarkSlotDirty(ReplicationSlot *slot)
+{
+       Assert(slot != NULL);
+
+       SpinLockAcquire(&slot->mutex);
+       slot->just_dirtied = true;
+       slot->dirty = true;
+       SpinLockRelease(&slot->mutex);
+}
This is somewhat similar to the relation between ReplicationSlotSave(serialize
my backend's replications slot) and SaveSlotToPath(save the passed slot).

Another thing is we have already taken spin lock to access
last_confirmed_flush_lsn from CheckpointReplicationSlots, we could set dirty
flag here itself, else we will have to release the lock and call
ReplicationSlotMarkDirty which will take lock again.

Yes, this is unavoidable, but maybe it's not a big problem as
we only do it at shutdown.

True but still it doesn't look elegant. I also thought about having a
probably inline function that marks both just_dirty and dirty fields.
However, that requires us to assert that the caller has already
acquired a spinlock. I see a macro SpinLockFree() that might help but
it didn't seem to be used anywhere in the code so not sure if we can
rely on it.

Instead shall we set just_dirtied also in CheckpointReplicationSlots?
Thoughts?

I agree we'd better set just_dirtied to true to ensure we will serialize slot
info here, because if some other processes just serialized the slot, the dirty
flag will be reset to false if we don't set just_dirtied to true in
CheckpointReplicationSlots(), this race condition may not exists for now, but
seems better to completely forbid it by setting just_dirtied.

Agreed, and it is better to close any such possibility because we
can't say with certainty about manual slots. This seems better than
the other ideas we discussed.

--
With Regards,
Amit Kapila.

#37

dilipbalaut@gmail.com

over 2 years ago

In reply to: vignesh C (#30)

Re: persist logical slots to disk during shutdown checkpoint

On Fri, Sep 1, 2023 at 12:16 PM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

All but one. Normally, the idea of marking dirty is to indicate that
we will actually write/flush the contents at a later point (except
when required for correctness) as even indicated in the comments atop
ReplicatioinSlotMarkDirty(). However, I see your point that we use
that protocol at all the current places including CreateSlotOnDisk().
So, we can probably do it here as well.

yes

I think we should also ensure that slots are not invalidated
(slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
because we don't allow decoding from such slots, so we shouldn't
include those.

Added this check.

Apart from this I have also fixed the following issues that were
agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
instead of setting it in SaveSlotToPath b) The comments were moved
from ReplicationSlot and moved to CheckPointReplicationSlots c) Tests
will be run in autovacuum = off d) updating last_saved_confirmed_flush
based on cp.slotdata.confirmed_flush rather than
slot->data.confirmed_flush.
I have also added the commitfest entry for this at [1].

The overall idea looks fine to me

+
+ /*
+ * We won't ensure that the slot is persisted after the
+ * confirmed_flush LSN is updated as that could lead to frequent
+ * writes.  However, we need to ensure that we do persist the slots at
+ * the time of shutdown whose confirmed_flush LSN is changed since we
+ * last saved the slot to disk. This will help in avoiding retreat of
+ * the confirmed_flush LSN after restart.
+ */
+ if (is_shutdown && SlotIsLogical(s))
+ {
+ SpinLockAcquire(&s->mutex);
+ if (s->data.invalidated == RS_INVAL_NONE &&
+ s->data.confirmed_flush != s->last_saved_confirmed_flush)
+ s->dirty = true;
+ SpinLockRelease(&s->mutex);
+ }

The comments don't mention anything about why we are just flushing at
the shutdown checkpoint. I mean the checkpoint is not that frequent
and we already perform a lot of I/O during checkpoints so isn't it
wise to flush during every checkpoint. We may argue that there is no
extra advantage of that as we are not optimizing for crash recovery
but OTOH there is no reason for not doing so for other checkpoints or
we are worried about the concurrency with parallel walsender running
during non shutdown checkpoint if so then better we explain that as
well? If it is already discussed in the thread and we have a
conclusion on this then maybe we can mention this in comments?

/*
  * Flush all replication slots to disk.
  *
- * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * is_shutdown is true in case of a shutdown checkpoint.
  */
 void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)

It seems we have removed two lines from the function header comments,
is this intentional or accidental?

Other than this patch LGTM.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#38

amit.kapila16@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#37)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, Sep 5, 2023 at 9:10 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Sep 1, 2023 at 12:16 PM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

All but one. Normally, the idea of marking dirty is to indicate that
we will actually write/flush the contents at a later point (except
when required for correctness) as even indicated in the comments atop
ReplicatioinSlotMarkDirty(). However, I see your point that we use
that protocol at all the current places including CreateSlotOnDisk().
So, we can probably do it here as well.

yes

I think we should also ensure that slots are not invalidated
(slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
because we don't allow decoding from such slots, so we shouldn't
include those.

Added this check.

Apart from this I have also fixed the following issues that were
agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
instead of setting it in SaveSlotToPath b) The comments were moved
from ReplicationSlot and moved to CheckPointReplicationSlots c) Tests
will be run in autovacuum = off d) updating last_saved_confirmed_flush
based on cp.slotdata.confirmed_flush rather than
slot->data.confirmed_flush.
I have also added the commitfest entry for this at [1].

The overall idea looks fine to me
+
+ /*
+ * We won't ensure that the slot is persisted after the
+ * confirmed_flush LSN is updated as that could lead to frequent
+ * writes.  However, we need to ensure that we do persist the slots at
+ * the time of shutdown whose confirmed_flush LSN is changed since we
+ * last saved the slot to disk. This will help in avoiding retreat of
+ * the confirmed_flush LSN after restart.
+ */
+ if (is_shutdown && SlotIsLogical(s))
+ {
+ SpinLockAcquire(&s->mutex);
+ if (s->data.invalidated == RS_INVAL_NONE &&
+ s->data.confirmed_flush != s->last_saved_confirmed_flush)
+ s->dirty = true;
+ SpinLockRelease(&s->mutex);
+ }
The comments don't mention anything about why we are just flushing at
the shutdown checkpoint. I mean the checkpoint is not that frequent
and we already perform a lot of I/O during checkpoints so isn't it
wise to flush during every checkpoint. We may argue that there is no
extra advantage of that as we are not optimizing for crash recovery
but OTOH there is no reason for not doing so for other checkpoints or
we are worried about the concurrency with parallel walsender running
during non shutdown checkpoint if so then better we explain that as
well? If it is already discussed in the thread and we have a
conclusion on this then maybe we can mention this in comments?

The point is that at the time of non-shutdown checkpoints, it is not
clear that there is an extra advantage but we will definitely add
extra I/O for this. Because at other times, we will already be saving
the slot from time to time as the replication makes progress. And, we
also need to flush such slots during shutdown for correctness for some
use cases like upgrades. We can probably add something like: "At other
times, the walsender keeps saving the slot from time to time as the
replication progresses, so there is no clear advantage of flushing
additional slots at the time of checkpoint". Will that work for you?
Having said that, I am not opposed to doing it for non-shutdown
checkpoints if one makes a separate case for it.

--
With Regards,
Amit Kapila.

#39

dilipbalaut@gmail.com

over 2 years ago

In reply to: Amit Kapila (#38)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, Sep 5, 2023 at 9:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 5, 2023 at 9:10 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

The comments don't mention anything about why we are just flushing at
the shutdown checkpoint. I mean the checkpoint is not that frequent
and we already perform a lot of I/O during checkpoints so isn't it
wise to flush during every checkpoint. We may argue that there is no
extra advantage of that as we are not optimizing for crash recovery
but OTOH there is no reason for not doing so for other checkpoints or
we are worried about the concurrency with parallel walsender running
during non shutdown checkpoint if so then better we explain that as
well? If it is already discussed in the thread and we have a
conclusion on this then maybe we can mention this in comments?

The point is that at the time of non-shutdown checkpoints, it is not
clear that there is an extra advantage but we will definitely add
extra I/O for this. Because at other times, we will already be saving
the slot from time to time as the replication makes progress. And, we
also need to flush such slots during shutdown for correctness for some
use cases like upgrades. We can probably add something like: "At other
times, the walsender keeps saving the slot from time to time as the
replication progresses, so there is no clear advantage of flushing
additional slots at the time of checkpoint". Will that work for you?

Yeah that comments will work out, my only concern was because we added
an explicit condition that it should be synced only during shutdown
checkpoint so better comments also explicitly explains the reason.
Anyway I am fine with either way whether we sync at the shutdown
checkpoint or all the checkpoint. Because I/O for slot sync during
checkpoint time should not be a real worry and with that if we can
avoid additional code with extra conditions then it's better because
such code branches will be frequently hit and I think for testability
pov we prefer to add code in common path unless there is some overhead
or it is specifically meant for that branch only.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#40

vignesh21@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#37)

1 attachment(s)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, 5 Sept 2023 at 09:10, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Sep 1, 2023 at 12:16 PM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

All but one. Normally, the idea of marking dirty is to indicate that
we will actually write/flush the contents at a later point (except
when required for correctness) as even indicated in the comments atop
ReplicatioinSlotMarkDirty(). However, I see your point that we use
that protocol at all the current places including CreateSlotOnDisk().
So, we can probably do it here as well.

yes

I think we should also ensure that slots are not invalidated
(slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
because we don't allow decoding from such slots, so we shouldn't
include those.

Added this check.

Apart from this I have also fixed the following issues that were
agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
instead of setting it in SaveSlotToPath b) The comments were moved
from ReplicationSlot and moved to CheckPointReplicationSlots c) Tests
will be run in autovacuum = off d) updating last_saved_confirmed_flush
based on cp.slotdata.confirmed_flush rather than
slot->data.confirmed_flush.
I have also added the commitfest entry for this at [1].

The overall idea looks fine to me
+
+ /*
+ * We won't ensure that the slot is persisted after the
+ * confirmed_flush LSN is updated as that could lead to frequent
+ * writes.  However, we need to ensure that we do persist the slots at
+ * the time of shutdown whose confirmed_flush LSN is changed since we
+ * last saved the slot to disk. This will help in avoiding retreat of
+ * the confirmed_flush LSN after restart.
+ */
+ if (is_shutdown && SlotIsLogical(s))
+ {
+ SpinLockAcquire(&s->mutex);
+ if (s->data.invalidated == RS_INVAL_NONE &&
+ s->data.confirmed_flush != s->last_saved_confirmed_flush)
+ s->dirty = true;
+ SpinLockRelease(&s->mutex);
+ }
The comments don't mention anything about why we are just flushing at
the shutdown checkpoint. I mean the checkpoint is not that frequent
and we already perform a lot of I/O during checkpoints so isn't it
wise to flush during every checkpoint. We may argue that there is no
extra advantage of that as we are not optimizing for crash recovery
but OTOH there is no reason for not doing so for other checkpoints or
we are worried about the concurrency with parallel walsender running
during non shutdown checkpoint if so then better we explain that as
well? If it is already discussed in the thread and we have a
conclusion on this then maybe we can mention this in comments?

I felt it is better to do this only during the shutdown checkpoint as
in other cases it is being saved periodically as and when the
replication happens. Added comments for the same.

/*
* Flush all replication slots to disk.
*
- * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * is_shutdown is true in case of a shutdown checkpoint.
*/
void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)

It seems we have removed two lines from the function header comments,
is this intentional or accidental?

Modified.
The updated v8 version patch has the changes for the same.

Regards,
Vignesh

Attachments:

v8-0001-Persist-logical-slots-to-disk-during-a-shutdown-c.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Persist-logical-slots-to-disk-during-a-shutdown-c.patchDownload

From 1e498fb2f13cebe49e8e187c664caf71da2aed5e Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Fri, 14 Apr 2023 13:49:09 +0800
Subject: [PATCH v8] Persist logical slots to disk during a shutdown checkpoint
 if required.

It's entirely possible for a logical slot to have a confirmed_flush LSN
higher than the last value saved on disk while not being marked as dirty.
Currently, it is not a major problem but a later patch adding support for
the upgrade of slots relies on that value being properly persisted to disk.

It can also help avoid processing the same transactions again in some
boundary cases after the clean shutdown and restart. Say, we process some
transactions for which we didn't send anything downstream (the changes got
filtered) but the confirm_flush LSN is updated due to keepalives. As we
don't flush the latest value of confirm_flush LSN, it may lead to
processing the same changes again without this patch.

Author: Vignesh C, Julien Rouhaud, Kuroda Hayato based on suggestions by
Ashutosh Bapat
Reviewed-by: Amit Kapila, Peter Smith, Ashutosh Bapat
Discussion: http://postgr.es/m/CAA4eK1JzJagMmb_E8D4au=GYQkxox0AfNBm1FbP7sy7t4YWXPQ@mail.gmail.com
Discussion: http://postgr.es/m/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 src/backend/access/transam/xlog.c             |   2 +-
 src/backend/replication/slot.c                |  34 +++++-
 src/include/replication/slot.h                |   5 +-
 src/test/recovery/meson.build                 |   1 +
 .../t/038_save_logical_slots_shutdown.pl      | 102 ++++++++++++++++++
 5 files changed, 139 insertions(+), 5 deletions(-)
 create mode 100644 src/test/recovery/t/038_save_logical_slots_shutdown.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f6f8adc72a..f26c8d18a6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7039,7 +7039,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
-	CheckPointReplicationSlots();
+	CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN);
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index bb09c4010f..6559a61753 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -321,6 +321,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->candidate_xmin_lsn = InvalidXLogRecPtr;
 	slot->candidate_restart_valid = InvalidXLogRecPtr;
 	slot->candidate_restart_lsn = InvalidXLogRecPtr;
+	slot->last_saved_confirmed_flush = InvalidXLogRecPtr;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -1573,10 +1574,10 @@ restart:
  * Flush all replication slots to disk.
  *
  * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * location. is_shutdown is true in case of a shutdown checkpoint.
  */
 void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)
 {
 	int			i;
 
@@ -1601,6 +1602,31 @@ CheckPointReplicationSlots(void)
 
 		/* save the slot to disk, locking is handled in SaveSlotToPath() */
 		sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
+
+		/*
+		 * We won't ensure that the slot is persisted after the
+		 * confirmed_flush LSN is updated as that could lead to frequent
+		 * writes.  However, we need to ensure that we do persist the slots at
+		 * the time of shutdown whose confirmed_flush LSN is changed since we
+		 * last saved the slot to disk. This will help in avoiding retreat of
+		 * the confirmed_flush LSN after restart. At other times, the walsender
+		 * keeps saving the slot from time to time as the replication
+		 * progresses, so there is no clear advantage of flushing additional
+		 * slots at the time of checkpoint.
+		 */
+		if (is_shutdown && SlotIsLogical(s))
+		{
+			SpinLockAcquire(&s->mutex);
+			if (s->data.invalidated == RS_INVAL_NONE &&
+				s->data.confirmed_flush != s->last_saved_confirmed_flush)
+			{
+				s->just_dirtied = true;
+				s->dirty = true;
+			}
+
+			SpinLockRelease(&s->mutex);
+		}
+
 		SaveSlotToPath(s, path, LOG);
 	}
 	LWLockRelease(ReplicationSlotAllocationLock);
@@ -1873,11 +1899,12 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	/*
 	 * Successfully wrote, unset dirty bit, unless somebody dirtied again
-	 * already.
+	 * already and remember the confirmed_flush LSN value.
 	 */
 	SpinLockAcquire(&slot->mutex);
 	if (!slot->just_dirtied)
 		slot->dirty = false;
+	slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
 	SpinLockRelease(&slot->mutex);
 
 	LWLockRelease(&slot->io_in_progress_lock);
@@ -2074,6 +2101,7 @@ RestoreSlotFromDisk(const char *name)
 		/* initialize in memory state */
 		slot->effective_xmin = cp.slotdata.xmin;
 		slot->effective_catalog_xmin = cp.slotdata.catalog_xmin;
+		slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
 
 		slot->candidate_catalog_xmin = InvalidTransactionId;
 		slot->candidate_xmin_lsn = InvalidXLogRecPtr;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8a89dc784..da8978342a 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -178,6 +178,9 @@ typedef struct ReplicationSlot
 	XLogRecPtr	candidate_xmin_lsn;
 	XLogRecPtr	candidate_restart_valid;
 	XLogRecPtr	candidate_restart_lsn;
+
+	/* This is used to track the last saved confirmed_flush LSN value */
+	XLogRecPtr	last_saved_confirmed_flush;
 } ReplicationSlot;
 
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
@@ -241,7 +244,7 @@ extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslo
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
-extern void CheckPointReplicationSlots(void);
+extern void CheckPointReplicationSlots(bool is_shutdown);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index e7328e4894..646d6ffde4 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -43,6 +43,7 @@ tests += {
       't/035_standby_logical_decoding.pl',
       't/036_truncated_dropped.pl',
       't/037_invalid_database.pl',
+      't/038_save_logical_slots_shutdown.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
new file mode 100644
index 0000000000..224a840a61
--- /dev/null
+++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
@@ -0,0 +1,102 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test logical replication slots are always persisted to disk during a shutdown
+# checkpoint.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub compare_confirmed_flush
+{
+	my ($node, $confirmed_flush_from_log) = @_;
+
+	# Fetch Latest checkpoint location from the control file
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data      = split("\n", $stdout);
+	my $latest_checkpoint = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint = $1;
+			last;
+		}
+	}
+	die "Latest checkpoint location not found in control file\n"
+	  unless defined($latest_checkpoint);
+
+	# Is it same as the value read from log?
+	ok( $latest_checkpoint eq $confirmed_flush_from_log,
+		"Check that the slot's confirmed_flush LSN is the same as the latest_checkpoint location"
+	);
+
+	return;
+}
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('pub');
+$node_publisher->init(allows_streaming => 'logical');
+# Avoid checkpoint during the test, otherwise, the latest checkpoint location
+# will change.
+$node_publisher->append_conf(
+	'postgresql.conf', q{
+checkpoint_timeout = 1h
+autovacuum = off
+});
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = PostgreSQL::Test::Cluster->new('sub');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create tables
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+
+# Insert some data
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tbl VALUES (generate_series(1, 5));");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION pub FOR ALL TABLES");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub CONNECTION '$publisher_connstr' PUBLICATION pub"
+);
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub');
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tbl");
+
+is($result, qq(5), "check initial copy was done");
+
+my $offset = -s $node_publisher->logfile;
+
+# Restart the publisher to ensure that the slot will be persisted if required
+$node_publisher->restart();
+
+# Wait until the walsender creates decoding context
+$node_publisher->wait_for_log(
+	qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./,
+	$offset);
+
+# Extract confirmed_flush from the logfile
+my $log_contents = slurp_file($node_publisher->logfile, $offset);
+$log_contents =~
+  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
+  or die "could not get confirmed_flush_lsn";
+
+# Ensure that the slot's confirmed_flush LSN is the same as the
+# latest_checkpoint location.
+compare_confirmed_flush($node_publisher, $1);
+
+done_testing();
-- 
2.34.1

#41

dilipbalaut@gmail.com

over 2 years ago

In reply to: Amit Kapila (#36)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, Sep 5, 2023 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 5, 2023 at 7:54 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
On Monday, September 4, 2023 6:15 PM vignesh C <vignesh21@gmail.com> wrote:
On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila

<amit.kapila16@gmail.com> wrote:

All but one. Normally, the idea of marking dirty is to
indicate that we will actually write/flush the contents at a
later point (except when required for correctness) as even
indicated in the comments atop ReplicatioinSlotMarkDirty().
However, I see your point that we use that protocol at all the current

places including CreateSlotOnDisk().
So, we can probably do it here as well.

yes

I think we should also ensure that slots are not invalidated
(slot.data.invalidated != RS_INVAL_NONE) before marking them dirty
because we don't allow decoding from such slots, so we shouldn't
include those.

Added this check.

Apart from this I have also fixed the following issues that were
agreed on: a) Setting slots to dirty in CheckPointReplicationSlots
instead of setting it in SaveSlotToPath
+ if (is_shutdown && SlotIsLogical(s)) { SpinLockAcquire(&s->mutex);
+ if (s->data.invalidated == RS_INVAL_NONE &&
+ s->data.confirmed_flush != s->last_saved_confirmed_flush) dirty =
+ s->true;
I think it is better to use ReplicationSlotMarkDirty() as that would
be consistent with all other usages.
ReplicationSlotMarkDirty works only on MyReplicationSlot whereas
CheckpointReplicationSlots loops through all the slots and marks the
appropriate slot as dirty, we might have to change ReplicationSlotMarkDirty to
take the slot as input parameter and all caller should pass MyReplicationSlot.
Personally, I feel if we want to centralize the code of marking dirty into a
function, we can introduce a new static function MarkSlotDirty(slot) to mark
passed slot dirty and let ReplicationSlotMarkDirty and
CheckpointReplicationSlots call it. Like:

void
ReplicationSlotMarkDirty(void)
{
MarkSlotDirty(MyReplicationSlot);
}
+static void
+MarkSlotDirty(ReplicationSlot *slot)
+{
+       Assert(slot != NULL);
+
+       SpinLockAcquire(&slot->mutex);
+       slot->just_dirtied = true;
+       slot->dirty = true;
+       SpinLockRelease(&slot->mutex);
+}
This is somewhat similar to the relation between ReplicationSlotSave(serialize
my backend's replications slot) and SaveSlotToPath(save the passed slot).

Another thing is we have already taken spin lock to access
last_confirmed_flush_lsn from CheckpointReplicationSlots, we could set dirty
flag here itself, else we will have to release the lock and call
ReplicationSlotMarkDirty which will take lock again.

Yes, this is unavoidable, but maybe it's not a big problem as
we only do it at shutdown.
True but still it doesn't look elegant. I also thought about having a
probably inline function that marks both just_dirty and dirty fields.
However, that requires us to assert that the caller has already
acquired a spinlock. I see a macro SpinLockFree() that might help but
it didn't seem to be used anywhere in the code so not sure if we can
rely on it.

Can't we just have code like this? I mean we will have to make
ReplicationSlotMarkDirty take slot as an argument or have another
version which takes slot as an argument and that would be called by us
as well as by ReplicationSlotMarkDirty(). I mean why do we need these
checks (s-(data.invalidated == RS_INVAL_NONE &&
s->data.confirmed_flush != s->last_saved_confirmed_flush) under the
mutex? Walsender is shutdown so confirmed flush LSN can not move
concurrently and slot can not be invalidated as well because that is
done by checkpointer and we are in checkpointer?

+ if (is_shutdown && SlotIsLogical(s))
+ {
+ if (s->data.invalidated == RS_INVAL_NONE &&
+ s->data.confirmed_flush != s->last_saved_confirmed_flush)
+ {
+ ReplicationSlotMarkDirty(s);
+ }
+
+ SpinLockRelease(&s->mutex);
+ }

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#42

Zhijie Hou (Fujitsu)

houzj.fnst@fujitsu.com

over 2 years ago

In reply to: Dilip Kumar (#41)

RE: persist logical slots to disk during shutdown checkpoint

On Tuesday, September 5, 2023 4:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Hi,

On Tue, Sep 5, 2023 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Sep 5, 2023 at 7:54 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Monday, September 4, 2023 6:15 PM vignesh C

<vignesh21@gmail.com> wrote:

On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com>

wrote:

On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com>

wrote:
I think we should also ensure that slots are not invalidated
(slot.data.invalidated != RS_INVAL_NONE) before marking them
dirty because we don't allow decoding from such slots, so we
shouldn't include those.

Added this check.

Apart from this I have also fixed the following issues that
were agreed on: a) Setting slots to dirty in
CheckPointReplicationSlots instead of setting it in
SaveSlotToPath
+ if (is_shutdown && SlotIsLogical(s)) {
+ SpinLockAcquire(&s->mutex); if (s->data.invalidated ==
+ RS_INVAL_NONE &&
+ s->data.confirmed_flush != s->last_saved_confirmed_flush)
+ s->dirty = true;
I think it is better to use ReplicationSlotMarkDirty() as that
would be consistent with all other usages.
ReplicationSlotMarkDirty works only on MyReplicationSlot whereas
CheckpointReplicationSlots loops through all the slots and marks
the appropriate slot as dirty, we might have to change
ReplicationSlotMarkDirty to take the slot as input parameter and all caller
should pass MyReplicationSlot.
Personally, I feel if we want to centralize the code of marking
dirty into a function, we can introduce a new static function
MarkSlotDirty(slot) to mark passed slot dirty and let
ReplicationSlotMarkDirty and CheckpointReplicationSlots call it. Like:

void
ReplicationSlotMarkDirty(void)
{
MarkSlotDirty(MyReplicationSlot); }
+static void
+MarkSlotDirty(ReplicationSlot *slot) {
+       Assert(slot != NULL);
+
+       SpinLockAcquire(&slot->mutex);
+       slot->just_dirtied = true;
+       slot->dirty = true;
+       SpinLockRelease(&slot->mutex); }
This is somewhat similar to the relation between
ReplicationSlotSave(serialize my backend's replications slot) and
SaveSlotToPath(save the passed slot).

Another thing is we have already taken spin lock to access
last_confirmed_flush_lsn from CheckpointReplicationSlots, we could
set dirty flag here itself, else we will have to release the lock
and call ReplicationSlotMarkDirty which will take lock again.

Yes, this is unavoidable, but maybe it's not a big problem as we
only do it at shutdown.

True but still it doesn't look elegant. I also thought about having a
probably inline function that marks both just_dirty and dirty fields.
However, that requires us to assert that the caller has already
acquired a spinlock. I see a macro SpinLockFree() that might help but
it didn't seem to be used anywhere in the code so not sure if we can
rely on it.

Can't we just have code like this? I mean we will have to make
ReplicationSlotMarkDirty take slot as an argument or have another version
which takes slot as an argument and that would be called by us as well as by
ReplicationSlotMarkDirty(). I mean why do we need these checks
(s-(data.invalidated == RS_INVAL_NONE &&
s->data.confirmed_flush != s->last_saved_confirmed_flush) under the
mutex? Walsender is shutdown so confirmed flush LSN can not move
concurrently and slot can not be invalidated as well because that is done by
checkpointer and we are in checkpointer?

I agree with your analysis that the lock may be unnecessary for now and the
code will work, but I personally feel we'd better take the spinlock.

Firstly, considering our discussion on the potential extension of persisting
the slot for online checkpoints in the future, we anyway need the lock at that
time, so taking the lock here could avoid overlooking the need to update it
later. And the lock also won't cause any performance or concurrency issue.

Additionally, if we don't take the lock, we rely on the assumption that the
walsender will exit before the shutdown checkpoint, currently, that's true for
logical walsender, but physical walsender can exit later than checkpointer. So,
I am slight woirred that if we change the logical walsender's exit timing in
the future, the assumption may not hold.

Besides, for non-built-in logical replication, if someone creates their own
walsender or other processes to send the changes and the process doesn't exit
before the shutdown checkpoint, it may also be a problem. Although I don't have
exsiting examples about these extensions, but I feel taking the lock would make
it more robust.

Best Regards,
Hou zj

#43

dilipbalaut@gmail.com

over 2 years ago

In reply to: Zhijie Hou (Fujitsu) (#42)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, Sep 5, 2023 at 5:04 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Can't we just have code like this? I mean we will have to make
ReplicationSlotMarkDirty take slot as an argument or have another version
which takes slot as an argument and that would be called by us as well as by
ReplicationSlotMarkDirty(). I mean why do we need these checks
(s-(data.invalidated == RS_INVAL_NONE &&
s->data.confirmed_flush != s->last_saved_confirmed_flush) under the
mutex? Walsender is shutdown so confirmed flush LSN can not move
concurrently and slot can not be invalidated as well because that is done by
checkpointer and we are in checkpointer?

I agree with your analysis that the lock may be unnecessary for now and the
code will work, but I personally feel we'd better take the spinlock.

Firstly, considering our discussion on the potential extension of persisting
the slot for online checkpoints in the future, we anyway need the lock at that
time, so taking the lock here could avoid overlooking the need to update it
later. And the lock also won't cause any performance or concurrency issue.

If we think that we might plan to persist on the online checkpoint as
well then better to do it now, because this is not a extension of the
feature instead we are thinking that it is wise to just persist on the
shutdown checkpoint and I think that's what the conclusion at this
point and if thats the conclusion then no point to right code in
assumption that we will change our conclusion in future.

Additionally, if we don't take the lock, we rely on the assumption that the
walsender will exit before the shutdown checkpoint, currently, that's true for
logical walsender, but physical walsender can exit later than checkpointer. So,
I am slight woirred that if we change the logical walsender's exit timing in
the future, the assumption may not hold.

Besides, for non-built-in logical replication, if someone creates their own
walsender or other processes to send the changes and the process doesn't exit
before the shutdown checkpoint, it may also be a problem. Although I don't have
exsiting examples about these extensions, but I feel taking the lock would make
it more robust.

I think our all logic is based on that the walsender is existed
already. If not then even if you check under the mutex that the
confirmed flush LSN is not changed then it can changed right after you
release the lock and then we will not be flushing the latest update of
the confirmed flush lsn to the disk and our logic of comparing
checkpoint.redo with the confirmed flush lsn might not work?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#44

amit.kapila16@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#43)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, Sep 5, 2023 at 6:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Sep 5, 2023 at 5:04 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Can't we just have code like this? I mean we will have to make
ReplicationSlotMarkDirty take slot as an argument or have another version
which takes slot as an argument and that would be called by us as well as by
ReplicationSlotMarkDirty(). I mean why do we need these checks
(s-(data.invalidated == RS_INVAL_NONE &&
s->data.confirmed_flush != s->last_saved_confirmed_flush) under the
mutex? Walsender is shutdown so confirmed flush LSN can not move
concurrently and slot can not be invalidated as well because that is done by
checkpointer and we are in checkpointer?

...

Additionally, if we don't take the lock, we rely on the assumption that the
walsender will exit before the shutdown checkpoint, currently, that's true for
logical walsender, but physical walsender can exit later than checkpointer. So,
I am slight woirred that if we change the logical walsender's exit timing in
the future, the assumption may not hold.

Besides, for non-built-in logical replication, if someone creates their own
walsender or other processes to send the changes and the process doesn't exit
before the shutdown checkpoint, it may also be a problem. Although I don't have
exsiting examples about these extensions, but I feel taking the lock would make
it more robust.

I think our all logic is based on that the walsender is existed
already. If not then even if you check under the mutex that the
confirmed flush LSN is not changed then it can changed right after you
release the lock and then we will not be flushing the latest update of
the confirmed flush lsn to the disk and our logic of comparing
checkpoint.redo with the confirmed flush lsn might not work?

Right, it can change and in that case, the check related to
confirm_flush LSN will fail during the upgrade. However, the point is
that if we don't take spinlock, we need to properly write comments on
why unlike in other places it is safe here to check these values
without spinlock. We can do that but I feel we have to be careful for
all future usages of these variables, so, having spinlock makes them
follow the normal coding pattern which I feel makes it more robust.
Yes, marking dirty via common function also has merits but personally,
I find it better to follow the normal coding practice of checking the
required fields under spinlock. The other possibility is to first
check if we need to mark the slot dirty under spinlock, then release
the spinlock, and then call the common MarkDirty function, but again
that will be under the assumption that these flags won't change.

--
With Regards,
Amit Kapila.

#45

dilipbalaut@gmail.com

over 2 years ago

In reply to: Amit Kapila (#44)

Re: persist logical slots to disk during shutdown checkpoint

On Wed, Sep 6, 2023 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 5, 2023 at 6:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Right, it can change and in that case, the check related to
confirm_flush LSN will fail during the upgrade. However, the point is
that if we don't take spinlock, we need to properly write comments on
why unlike in other places it is safe here to check these values
without spinlock.

I agree with that, but now also it is not true that we are alway
reading this under the spin lock for example[1]StartLogicalReplication { /* * Report the location after which we'll send out further commits as the * current sentPtr. */ sentPtr = MyReplicationSlot->data.confirmed_flush; }[2]LogicalIncreaseRestartDecodingForSlot { /* candidates are already valid with the current flush position, apply */ if (updated_lsn) LogicalConfirmReceivedLocation(slot->data.confirmed_flush); }, we can see we are
reading this without spin lock.
[1]: StartLogicalReplication { /* * Report the location after which we'll send out further commits as the * current sentPtr. */ sentPtr = MyReplicationSlot->data.confirmed_flush; }
StartLogicalReplication
{
/*
* Report the location after which we'll send out further commits as the
* current sentPtr.
*/
sentPtr = MyReplicationSlot->data.confirmed_flush;
}
[2]: LogicalIncreaseRestartDecodingForSlot { /* candidates are already valid with the current flush position, apply */ if (updated_lsn) LogicalConfirmReceivedLocation(slot->data.confirmed_flush); }
LogicalIncreaseRestartDecodingForSlot
{
/* candidates are already valid with the current flush position, apply */
if (updated_lsn)
LogicalConfirmReceivedLocation(slot->data.confirmed_flush);
}

We can do that but I feel we have to be careful for

all future usages of these variables, so, having spinlock makes them
follow the normal coding pattern which I feel makes it more robust.
Yes, marking dirty via common function also has merits but personally,
I find it better to follow the normal coding practice of checking the
required fields under spinlock. The other possibility is to first
check if we need to mark the slot dirty under spinlock, then release
the spinlock, and then call the common MarkDirty function, but again
that will be under the assumption that these flags won't change.

Thats true, but we are already making the assumption because now also
we are taking the spinlock and taking a decision of marking the slot
dirty. And after that we are releasing the spin lock and if we do not
have guarantee that it can not concurrently change the many things can
go wrong no?

Anyway said that, I do not have any strong objection against what we
are doing now. There were discussion around making the code so that
it can use common function and I was suggesting how it could be
achieved but I am not against the current way either.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#46

amit.kapila16@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#45)

Re: persist logical slots to disk during shutdown checkpoint

On Wed, Sep 6, 2023 at 9:57 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Sep 6, 2023 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 5, 2023 at 6:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Right, it can change and in that case, the check related to
confirm_flush LSN will fail during the upgrade. However, the point is
that if we don't take spinlock, we need to properly write comments on
why unlike in other places it is safe here to check these values
without spinlock.

I agree with that, but now also it is not true that we are alway
reading this under the spin lock for example[1][2], we can see we are
reading this without spin lock.
[1]
StartLogicalReplication
{
/*
* Report the location after which we'll send out further commits as the
* current sentPtr.
*/
sentPtr = MyReplicationSlot->data.confirmed_flush;
}
[2]
LogicalIncreaseRestartDecodingForSlot
{
/* candidates are already valid with the current flush position, apply */
if (updated_lsn)
LogicalConfirmReceivedLocation(slot->data.confirmed_flush);
}

These are accessed only in walsender and confirmed_flush is always
updated by walsender. So, this is clearly okay.

We can do that but I feel we have to be careful for

all future usages of these variables, so, having spinlock makes them
follow the normal coding pattern which I feel makes it more robust.
Yes, marking dirty via common function also has merits but personally,
I find it better to follow the normal coding practice of checking the
required fields under spinlock. The other possibility is to first
check if we need to mark the slot dirty under spinlock, then release
the spinlock, and then call the common MarkDirty function, but again
that will be under the assumption that these flags won't change.

Thats true, but we are already making the assumption because now also
we are taking the spinlock and taking a decision of marking the slot
dirty. And after that we are releasing the spin lock and if we do not
have guarantee that it can not concurrently change the many things can
go wrong no?

Also, note that invalidated field could be updated by startup process
but that is only possible on standby, so it is safe but again that
would be another assumption.

Anyway said that, I do not have any strong objection against what we
are doing now. There were discussion around making the code so that
it can use common function and I was suggesting how it could be
achieved but I am not against the current way either.

Okay, thanks for looking into it.

--
With Regards,
Amit Kapila.

#47

dilipbalaut@gmail.com

over 2 years ago

In reply to: Amit Kapila (#46)

Re: persist logical slots to disk during shutdown checkpoint

On Wed, Sep 6, 2023 at 12:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 6, 2023 at 9:57 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Sep 6, 2023 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 5, 2023 at 6:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Right, it can change and in that case, the check related to
confirm_flush LSN will fail during the upgrade. However, the point is
that if we don't take spinlock, we need to properly write comments on
why unlike in other places it is safe here to check these values
without spinlock.

I agree with that, but now also it is not true that we are alway
reading this under the spin lock for example[1][2], we can see we are
reading this without spin lock.
[1]
StartLogicalReplication
{
/*
* Report the location after which we'll send out further commits as the
* current sentPtr.
*/
sentPtr = MyReplicationSlot->data.confirmed_flush;
}
[2]
LogicalIncreaseRestartDecodingForSlot
{
/* candidates are already valid with the current flush position, apply */
if (updated_lsn)
LogicalConfirmReceivedLocation(slot->data.confirmed_flush);
}

These are accessed only in walsender and confirmed_flush is always
updated by walsender. So, this is clearly okay.

Hmm, that's a valid point.

We can do that but I feel we have to be careful for

all future usages of these variables, so, having spinlock makes them
follow the normal coding pattern which I feel makes it more robust.
Yes, marking dirty via common function also has merits but personally,
I find it better to follow the normal coding practice of checking the
required fields under spinlock. The other possibility is to first
check if we need to mark the slot dirty under spinlock, then release
the spinlock, and then call the common MarkDirty function, but again
that will be under the assumption that these flags won't change.

Thats true, but we are already making the assumption because now also
we are taking the spinlock and taking a decision of marking the slot
dirty. And after that we are releasing the spin lock and if we do not
have guarantee that it can not concurrently change the many things can
go wrong no?

Also, note that invalidated field could be updated by startup process
but that is only possible on standby, so it is safe but again that
would be another assumption.

Okay, so I also agree to go with the current patch. Because as you
said above if we access this without a spin lock outside walsender
then we will be making a new exception and I agree with that decision
of not making the new exception.

Other than that the patch LGTM.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#48

vignesh21@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#47)

1 attachment(s)

Re: persist logical slots to disk during shutdown checkpoint

On Wed, 6 Sept 2023 at 12:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Other than that the patch LGTM.

pgindent reported that the new comments added need to be re-adjusted,
here is an updated patch for the same.
I also verified the following: a) patch applies neatly on HEAD b) make
check-world passes c) pgindent looks good d) pgperltiy was fine e)
meson test runs were successful. Also checked that CFBot run was fine
for the last patch.

Regards,
Vignesh

Attachments:

v9-0001-Persist-logical-slots-to-disk-during-a-shutdown-c.patchtext/x-patch; charset=US-ASCII; name=v9-0001-Persist-logical-slots-to-disk-during-a-shutdown-c.patchDownload

From d748cd6d78c2ae5fc38552166572834876381336 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Fri, 14 Apr 2023 13:49:09 +0800
Subject: [PATCH v9] Persist logical slots to disk during a shutdown checkpoint
 if required.

It's entirely possible for a logical slot to have a confirmed_flush LSN
higher than the last value saved on disk while not being marked as dirty.
Currently, it is not a major problem but a later patch adding support for
the upgrade of slots relies on that value being properly persisted to disk.

It can also help avoid processing the same transactions again in some
boundary cases after the clean shutdown and restart. Say, we process some
transactions for which we didn't send anything downstream (the changes got
filtered) but the confirm_flush LSN is updated due to keepalives. As we
don't flush the latest value of confirm_flush LSN, it may lead to
processing the same changes again without this patch.

Author: Vignesh C, Julien Rouhaud, Kuroda Hayato based on suggestions by
Ashutosh Bapat
Reviewed-by: Amit Kapila, Peter Smith, Ashutosh Bapat
Discussion: http://postgr.es/m/CAA4eK1JzJagMmb_E8D4au=GYQkxox0AfNBm1FbP7sy7t4YWXPQ@mail.gmail.com
Discussion: http://postgr.es/m/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 src/backend/access/transam/xlog.c             |   2 +-
 src/backend/replication/slot.c                |  34 +++++-
 src/include/replication/slot.h                |   5 +-
 src/test/recovery/meson.build                 |   1 +
 .../t/038_save_logical_slots_shutdown.pl      | 102 ++++++++++++++++++
 5 files changed, 139 insertions(+), 5 deletions(-)
 create mode 100644 src/test/recovery/t/038_save_logical_slots_shutdown.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f6f8adc72a..f26c8d18a6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7039,7 +7039,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
-	CheckPointReplicationSlots();
+	CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN);
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index bb09c4010f..2d22b4e956 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -321,6 +321,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->candidate_xmin_lsn = InvalidXLogRecPtr;
 	slot->candidate_restart_valid = InvalidXLogRecPtr;
 	slot->candidate_restart_lsn = InvalidXLogRecPtr;
+	slot->last_saved_confirmed_flush = InvalidXLogRecPtr;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -1573,10 +1574,10 @@ restart:
  * Flush all replication slots to disk.
  *
  * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * location. is_shutdown is true in case of a shutdown checkpoint.
  */
 void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)
 {
 	int			i;
 
@@ -1601,6 +1602,31 @@ CheckPointReplicationSlots(void)
 
 		/* save the slot to disk, locking is handled in SaveSlotToPath() */
 		sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
+
+		/*
+		 * We won't ensure that the slot is persisted after the
+		 * confirmed_flush LSN is updated as that could lead to frequent
+		 * writes.  However, we need to ensure that we do persist the slots at
+		 * the time of shutdown whose confirmed_flush LSN is changed since we
+		 * last saved the slot to disk. This will help in avoiding retreat of
+		 * the confirmed_flush LSN after restart. At other times, the
+		 * walsender keeps saving the slot from time to time as the
+		 * replication progresses, so there is no clear advantage of flushing
+		 * additional slots at the time of checkpoint.
+		 */
+		if (is_shutdown && SlotIsLogical(s))
+		{
+			SpinLockAcquire(&s->mutex);
+			if (s->data.invalidated == RS_INVAL_NONE &&
+				s->data.confirmed_flush != s->last_saved_confirmed_flush)
+			{
+				s->just_dirtied = true;
+				s->dirty = true;
+			}
+
+			SpinLockRelease(&s->mutex);
+		}
+
 		SaveSlotToPath(s, path, LOG);
 	}
 	LWLockRelease(ReplicationSlotAllocationLock);
@@ -1873,11 +1899,12 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	/*
 	 * Successfully wrote, unset dirty bit, unless somebody dirtied again
-	 * already.
+	 * already and remember the confirmed_flush LSN value.
 	 */
 	SpinLockAcquire(&slot->mutex);
 	if (!slot->just_dirtied)
 		slot->dirty = false;
+	slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
 	SpinLockRelease(&slot->mutex);
 
 	LWLockRelease(&slot->io_in_progress_lock);
@@ -2074,6 +2101,7 @@ RestoreSlotFromDisk(const char *name)
 		/* initialize in memory state */
 		slot->effective_xmin = cp.slotdata.xmin;
 		slot->effective_catalog_xmin = cp.slotdata.catalog_xmin;
+		slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
 
 		slot->candidate_catalog_xmin = InvalidTransactionId;
 		slot->candidate_xmin_lsn = InvalidXLogRecPtr;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8a89dc784..da8978342a 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -178,6 +178,9 @@ typedef struct ReplicationSlot
 	XLogRecPtr	candidate_xmin_lsn;
 	XLogRecPtr	candidate_restart_valid;
 	XLogRecPtr	candidate_restart_lsn;
+
+	/* This is used to track the last saved confirmed_flush LSN value */
+	XLogRecPtr	last_saved_confirmed_flush;
 } ReplicationSlot;
 
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
@@ -241,7 +244,7 @@ extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslo
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
-extern void CheckPointReplicationSlots(void);
+extern void CheckPointReplicationSlots(bool is_shutdown);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index e7328e4894..646d6ffde4 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -43,6 +43,7 @@ tests += {
       't/035_standby_logical_decoding.pl',
       't/036_truncated_dropped.pl',
       't/037_invalid_database.pl',
+      't/038_save_logical_slots_shutdown.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
new file mode 100644
index 0000000000..224a840a61
--- /dev/null
+++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
@@ -0,0 +1,102 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test logical replication slots are always persisted to disk during a shutdown
+# checkpoint.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub compare_confirmed_flush
+{
+	my ($node, $confirmed_flush_from_log) = @_;
+
+	# Fetch Latest checkpoint location from the control file
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data      = split("\n", $stdout);
+	my $latest_checkpoint = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint = $1;
+			last;
+		}
+	}
+	die "Latest checkpoint location not found in control file\n"
+	  unless defined($latest_checkpoint);
+
+	# Is it same as the value read from log?
+	ok( $latest_checkpoint eq $confirmed_flush_from_log,
+		"Check that the slot's confirmed_flush LSN is the same as the latest_checkpoint location"
+	);
+
+	return;
+}
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('pub');
+$node_publisher->init(allows_streaming => 'logical');
+# Avoid checkpoint during the test, otherwise, the latest checkpoint location
+# will change.
+$node_publisher->append_conf(
+	'postgresql.conf', q{
+checkpoint_timeout = 1h
+autovacuum = off
+});
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = PostgreSQL::Test::Cluster->new('sub');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create tables
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+
+# Insert some data
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tbl VALUES (generate_series(1, 5));");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION pub FOR ALL TABLES");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub CONNECTION '$publisher_connstr' PUBLICATION pub"
+);
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub');
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tbl");
+
+is($result, qq(5), "check initial copy was done");
+
+my $offset = -s $node_publisher->logfile;
+
+# Restart the publisher to ensure that the slot will be persisted if required
+$node_publisher->restart();
+
+# Wait until the walsender creates decoding context
+$node_publisher->wait_for_log(
+	qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./,
+	$offset);
+
+# Extract confirmed_flush from the logfile
+my $log_contents = slurp_file($node_publisher->logfile, $offset);
+$log_contents =~
+  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
+  or die "could not get confirmed_flush_lsn";
+
+# Ensure that the slot's confirmed_flush LSN is the same as the
+# latest_checkpoint location.
+compare_confirmed_flush($node_publisher, $1);
+
+done_testing();
-- 
2.34.1

#49

amit.kapila16@gmail.com

over 2 years ago

In reply to: vignesh C (#48)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Sep 7, 2023 at 10:13 AM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 6 Sept 2023 at 12:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Other than that the patch LGTM.

pgindent reported that the new comments added need to be re-adjusted,
here is an updated patch for the same.

Thanks, the patch looks good to me as well.

--
With Regards,
Amit Kapila.

#50

michael@paquier.xyz

over 2 years ago

In reply to: Amit Kapila (#49)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Sep 07, 2023 at 11:56:28AM +0530, Amit Kapila wrote:

Thanks, the patch looks good to me as well.

+   /* This is used to track the last saved confirmed_flush LSN value */
+   XLogRecPtr  last_saved_confirmed_flush;

This does not feel sufficient, as the comment explaining what this
variable does uses the same terms as the variable name (aka it is the
last save of the confirmed_lsn). Why it it here and why it is useful?
In which context and/or code paths is it used? Okay, there are some
explanations when saving a slot, restoring a slot or when a checkpoint
processes slots, but it seems important to me to document more things
in ReplicationSlot where this is defined.

(Just passing by, I have not checked the patch logic in details but
that looks under-documented to me.)
--
Michael

#51

[1]: /messages/by-id/CAExHW5uXq_CU80XJtmWbPJinRjJx54kbQJ9DT=UFySKXpjVwrw@mail.gmail.com

amit.kapila16@gmail.com

over 2 years ago

In reply to: Michael Paquier (#50)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Sep 7, 2023 at 1:18 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Sep 07, 2023 at 11:56:28AM +0530, Amit Kapila wrote:

Thanks, the patch looks good to me as well.
+   /* This is used to track the last saved confirmed_flush LSN value */
+   XLogRecPtr  last_saved_confirmed_flush;
This does not feel sufficient, as the comment explaining what this
variable does uses the same terms as the variable name (aka it is the
last save of the confirmed_lsn). Why it it here and why it is useful?
In which context and/or code paths is it used? Okay, there are some
explanations when saving a slot, restoring a slot or when a checkpoint
processes slots, but it seems important to me to document more things
in ReplicationSlot where this is defined.

Hmm, this is quite debatable as different people feel differently
about this. The patch author kept it where it is now but in one of my
revisions, I rewrote and added it in the ReplicationSlot. Then
Ashutosh argued that it is better to keep it near where we are saving
the slot (aka where the patch has) [1]/messages/by-id/CAExHW5uXq_CU80XJtmWbPJinRjJx54kbQJ9DT=UFySKXpjVwrw@mail.gmail.com. Anyway, as I also preferred
the core part of the theory about this variable to be in
ReplicationSlot, so I'll move it there before commit unless someone
argues against it or has any other comments.

--
With Regards,
Amit Kapila.

#52

ashutosh.bapat.oss@gmail.com

over 2 years ago

In reply to: Amit Kapila (#51)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Sep 7, 2023 at 1:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 7, 2023 at 1:18 PM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Sep 07, 2023 at 11:56:28AM +0530, Amit Kapila wrote:

Thanks, the patch looks good to me as well.
+   /* This is used to track the last saved confirmed_flush LSN value */
+   XLogRecPtr  last_saved_confirmed_flush;
This does not feel sufficient, as the comment explaining what this
variable does uses the same terms as the variable name (aka it is the
last save of the confirmed_lsn). Why it it here and why it is useful?
In which context and/or code paths is it used? Okay, there are some
explanations when saving a slot, restoring a slot or when a checkpoint
processes slots, but it seems important to me to document more things
in ReplicationSlot where this is defined.
Hmm, this is quite debatable as different people feel differently
about this. The patch author kept it where it is now but in one of my
revisions, I rewrote and added it in the ReplicationSlot. Then
Ashutosh argued that it is better to keep it near where we are saving
the slot (aka where the patch has) [1]. Anyway, as I also preferred
the core part of the theory about this variable to be in
ReplicationSlot, so I'll move it there before commit unless someone
argues against it or has any other comments.

If we want it to be in ReplicationSlot, I suggest we just say, - saves
last confirmed flush LSN to detect any divergence in the in-memory and
on-disk confirmed flush LSN cheaply.

When to detect that divergence and what to do when there is divergence
should be document at relevant places in the code. In future if we
expand the When and How we use this variable, the comment in
ReplicationSlot will be insufficient.

We follow this commenting style at several places e.g.
/* any outstanding modifications? */
bool just_dirtied;
bool dirty;

how and when these variables are used is commented upon in the relevant code.

  * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * location. is_shutdown is true in case of a shutdown checkpoint.

Relying on the first sentence, if we decide to not persist the
replication slot at the time of checkpoint, would that be OK? It
doesn't look like a convenience thing to me any more.

--
Best Wishes,
Ashutosh Bapat

#53

amit.kapila16@gmail.com

over 2 years ago

In reply to: Ashutosh Bapat (#52)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Sep 7, 2023 at 3:38 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

* This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * location. is_shutdown is true in case of a shutdown checkpoint.
Relying on the first sentence, if we decide to not persist the
replication slot at the time of checkpoint, would that be OK? It
doesn't look like a convenience thing to me any more.

Instead of removing that comment, how about something like this: "This
needn't actually be part of a checkpoint except for shutdown
checkpoint, but it's a convenient location."?

--
With Regards,
Amit Kapila.

#54

ashutosh.bapat.oss@gmail.com

over 2 years ago

In reply to: Amit Kapila (#53)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Sep 7, 2023 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 7, 2023 at 3:38 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
* This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * location. is_shutdown is true in case of a shutdown checkpoint.
Relying on the first sentence, if we decide to not persist the
replication slot at the time of checkpoint, would that be OK? It
doesn't look like a convenience thing to me any more.
Instead of removing that comment, how about something like this: "This
needn't actually be part of a checkpoint except for shutdown
checkpoint, but it's a convenient location."?

I find the wording a bit awkward. My version would be "Checkpoint is a
convenient location to persist all the slots. But in a shutdown
checkpoint, indicated by is_shutdown = true, we also update
confirmed_flush." But please feel free to choose whichever version you
are comfortable with.

--
Best Wishes,
Ashutosh Bapat

#55

amit.kapila16@gmail.com

over 2 years ago

In reply to: Ashutosh Bapat (#54)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Sep 7, 2023 at 4:30 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Sep 7, 2023 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Sep 7, 2023 at 3:38 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
* This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * location. is_shutdown is true in case of a shutdown checkpoint.
Relying on the first sentence, if we decide to not persist the
replication slot at the time of checkpoint, would that be OK? It
doesn't look like a convenience thing to me any more.
Instead of removing that comment, how about something like this: "This
needn't actually be part of a checkpoint except for shutdown
checkpoint, but it's a convenient location."?
I find the wording a bit awkward. My version would be "Checkpoint is a
convenient location to persist all the slots. But in a shutdown
checkpoint, indicated by is_shutdown = true, we also update
confirmed_flush." But please feel free to choose whichever version you
are comfortable with.

I think saying we also update confirmed_flush appears unclear to me.
So, I tried another version by changing the entire comment to:
"Normally, we can flush dirty replication slots at regular intervals
by any background process like bgwriter but checkpoint is a convenient
location to persist. Additionally, in case of a shutdown checkpoint,
we also identify the slots for which confirmed_flush has been updated
since the last time it persisted and flush them."

--
With Regards,
Amit Kapila.

#56

amit.kapila16@gmail.com

over 2 years ago

In reply to: Ashutosh Bapat (#52)

1 attachment(s)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Sep 7, 2023 at 3:38 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Sep 7, 2023 at 1:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Sep 7, 2023 at 1:18 PM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Sep 07, 2023 at 11:56:28AM +0530, Amit Kapila wrote:

Thanks, the patch looks good to me as well.
+   /* This is used to track the last saved confirmed_flush LSN value */
+   XLogRecPtr  last_saved_confirmed_flush;
This does not feel sufficient, as the comment explaining what this
variable does uses the same terms as the variable name (aka it is the
last save of the confirmed_lsn). Why it it here and why it is useful?
In which context and/or code paths is it used? Okay, there are some
explanations when saving a slot, restoring a slot or when a checkpoint
processes slots, but it seems important to me to document more things
in ReplicationSlot where this is defined.
Hmm, this is quite debatable as different people feel differently
about this. The patch author kept it where it is now but in one of my
revisions, I rewrote and added it in the ReplicationSlot. Then
Ashutosh argued that it is better to keep it near where we are saving
the slot (aka where the patch has) [1]. Anyway, as I also preferred
the core part of the theory about this variable to be in
ReplicationSlot, so I'll move it there before commit unless someone
argues against it or has any other comments.
If we want it to be in ReplicationSlot, I suggest we just say, - saves
last confirmed flush LSN to detect any divergence in the in-memory and
on-disk confirmed flush LSN cheaply.

I have added something on these lines and also changed the other
comment pointed out by you. In the passing, I made minor cosmetic
changes as well.

--
With Regards,
Amit Kapila.

Attachments:

v10-0001-Persist-logical-slots-to-disk-during-a-shutdown-.patchapplication/octet-stream; name=v10-0001-Persist-logical-slots-to-disk-during-a-shutdown-.patchDownload

From 70496bd3d16a71b06c38bb7dac528436a3c459cf Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Fri, 14 Apr 2023 13:49:09 +0800
Subject: [PATCH v10] Persist logical slots to disk during a shutdown
 checkpoint if required.

It's entirely possible for a logical slot to have a confirmed_flush LSN
higher than the last value saved on disk while not being marked as dirty.
Currently, it is not a major problem but a later patch adding support for
the upgrade of slots relies on that value being properly persisted to disk.

It can also help avoid processing the same transactions again in some
boundary cases after the clean shutdown and restart. Say, we process some
transactions for which we didn't send anything downstream (the changes got
filtered) but the confirm_flush LSN is updated due to keepalives. As we
don't flush the latest value of confirm_flush LSN, it may lead to
processing the same changes again without this patch.

Author: Vignesh C, Julien Rouhaud, Kuroda Hayato based on suggestions by
Ashutosh Bapat
Reviewed-by: Amit Kapila, Dilip Kumar, Ashutosh Bapat, Peter Smith
Discussion: http://postgr.es/m/CAA4eK1JzJagMmb_E8D4au=GYQkxox0AfNBm1FbP7sy7t4YWXPQ@mail.gmail.com
Discussion: http://postgr.es/m/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 src/backend/access/transam/xlog.c             |   2 +-
 src/backend/replication/slot.c                |  38 ++++++-
 src/include/replication/slot.h                |   8 +-
 src/test/recovery/meson.build                 |   1 +
 .../t/038_save_logical_slots_shutdown.pl      | 102 ++++++++++++++++++
 5 files changed, 145 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/038_save_logical_slots_shutdown.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f6f8adc72a..f26c8d18a6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7039,7 +7039,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
-	CheckPointReplicationSlots();
+	CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN);
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index bb09c4010f..5a6d376cfa 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -321,6 +321,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->candidate_xmin_lsn = InvalidXLogRecPtr;
 	slot->candidate_restart_valid = InvalidXLogRecPtr;
 	slot->candidate_restart_lsn = InvalidXLogRecPtr;
+	slot->last_saved_confirmed_flush = InvalidXLogRecPtr;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -1572,11 +1573,14 @@ restart:
 /*
  * Flush all replication slots to disk.
  *
- * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * We can flush dirty replication slots at regular intervals by any
+ * background process like bgwriter but checkpoint is a convenient location.
+ * Additionally, in case of a shutdown checkpoint, we also identify the
+ * slots for which confirmed_flush LSN has been updated since the last time
+ * it persisted and flush them.
  */
 void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)
 {
 	int			i;
 
@@ -1601,6 +1605,30 @@ CheckPointReplicationSlots(void)
 
 		/* save the slot to disk, locking is handled in SaveSlotToPath() */
 		sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
+
+		/*
+		 * We won't ensure that the slot is persisted after the
+		 * confirmed_flush LSN is updated as that could lead to frequent
+		 * writes. However, we need to ensure that we do persist the slots at
+		 * the time of shutdown whose confirmed_flush LSN is changed since we
+		 * last saved the slot to disk. This will help in avoiding retreat of
+		 * the confirmed_flush LSN after restart. At other times, the
+		 * walsender keeps saving the slot from time to time as the
+		 * replication progresses, so there is no clear advantage of flushing
+		 * additional slots at the time of checkpoint.
+		 */
+		if (is_shutdown && SlotIsLogical(s))
+		{
+			SpinLockAcquire(&s->mutex);
+			if (s->data.invalidated == RS_INVAL_NONE &&
+				s->data.confirmed_flush != s->last_saved_confirmed_flush)
+			{
+				s->just_dirtied = true;
+				s->dirty = true;
+			}
+			SpinLockRelease(&s->mutex);
+		}
+
 		SaveSlotToPath(s, path, LOG);
 	}
 	LWLockRelease(ReplicationSlotAllocationLock);
@@ -1873,11 +1901,12 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	/*
 	 * Successfully wrote, unset dirty bit, unless somebody dirtied again
-	 * already.
+	 * already and remember the confirmed_flush LSN value.
 	 */
 	SpinLockAcquire(&slot->mutex);
 	if (!slot->just_dirtied)
 		slot->dirty = false;
+	slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
 	SpinLockRelease(&slot->mutex);
 
 	LWLockRelease(&slot->io_in_progress_lock);
@@ -2074,6 +2103,7 @@ RestoreSlotFromDisk(const char *name)
 		/* initialize in memory state */
 		slot->effective_xmin = cp.slotdata.xmin;
 		slot->effective_catalog_xmin = cp.slotdata.catalog_xmin;
+		slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
 
 		slot->candidate_catalog_xmin = InvalidTransactionId;
 		slot->candidate_xmin_lsn = InvalidXLogRecPtr;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8a89dc784..4c4fe10e57 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -178,6 +178,12 @@ typedef struct ReplicationSlot
 	XLogRecPtr	candidate_xmin_lsn;
 	XLogRecPtr	candidate_restart_valid;
 	XLogRecPtr	candidate_restart_lsn;
+
+	/*
+	 * This is used to track the last persisted confirmed_flush LSN value to
+	 * detect any divergence in the in-memory and on-disk values for the same.
+	 */
+	XLogRecPtr	last_saved_confirmed_flush;
 } ReplicationSlot;
 
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
@@ -241,7 +247,7 @@ extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslo
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
-extern void CheckPointReplicationSlots(void);
+extern void CheckPointReplicationSlots(bool is_shutdown);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index e7328e4894..646d6ffde4 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -43,6 +43,7 @@ tests += {
       't/035_standby_logical_decoding.pl',
       't/036_truncated_dropped.pl',
       't/037_invalid_database.pl',
+      't/038_save_logical_slots_shutdown.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
new file mode 100644
index 0000000000..224a840a61
--- /dev/null
+++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
@@ -0,0 +1,102 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test logical replication slots are always persisted to disk during a shutdown
+# checkpoint.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub compare_confirmed_flush
+{
+	my ($node, $confirmed_flush_from_log) = @_;
+
+	# Fetch Latest checkpoint location from the control file
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data      = split("\n", $stdout);
+	my $latest_checkpoint = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint = $1;
+			last;
+		}
+	}
+	die "Latest checkpoint location not found in control file\n"
+	  unless defined($latest_checkpoint);
+
+	# Is it same as the value read from log?
+	ok( $latest_checkpoint eq $confirmed_flush_from_log,
+		"Check that the slot's confirmed_flush LSN is the same as the latest_checkpoint location"
+	);
+
+	return;
+}
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('pub');
+$node_publisher->init(allows_streaming => 'logical');
+# Avoid checkpoint during the test, otherwise, the latest checkpoint location
+# will change.
+$node_publisher->append_conf(
+	'postgresql.conf', q{
+checkpoint_timeout = 1h
+autovacuum = off
+});
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = PostgreSQL::Test::Cluster->new('sub');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create tables
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+
+# Insert some data
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tbl VALUES (generate_series(1, 5));");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION pub FOR ALL TABLES");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub CONNECTION '$publisher_connstr' PUBLICATION pub"
+);
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub');
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tbl");
+
+is($result, qq(5), "check initial copy was done");
+
+my $offset = -s $node_publisher->logfile;
+
+# Restart the publisher to ensure that the slot will be persisted if required
+$node_publisher->restart();
+
+# Wait until the walsender creates decoding context
+$node_publisher->wait_for_log(
+	qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./,
+	$offset);
+
+# Extract confirmed_flush from the logfile
+my $log_contents = slurp_file($node_publisher->logfile, $offset);
+$log_contents =~
+  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
+  or die "could not get confirmed_flush_lsn";
+
+# Ensure that the slot's confirmed_flush LSN is the same as the
+# latest_checkpoint location.
+compare_confirmed_flush($node_publisher, $1);
+
+done_testing();
-- 
2.28.0.windows.1

#57

michael@paquier.xyz

over 2 years ago

In reply to: Amit Kapila (#56)

Re: persist logical slots to disk during shutdown checkpoint

On Fri, Sep 08, 2023 at 09:04:43AM +0530, Amit Kapila wrote:

I have added something on these lines and also changed the other
comment pointed out by you. In the passing, I made minor cosmetic
changes as well.

+ * We can flush dirty replication slots at regular intervals by any
+ * background process like bgwriter but checkpoint is a convenient location.

I don't see a need to refer to the bgwriter here. On the contrary, it
can be confusing as one could think that this flush happens in the
bgwriter, but that's not the case currently as only the checkpointer
does that.

+ * We won't ensure that the slot is persisted after the

s/persisted/flushed/? Or just refer to the "slot's data being
flushed", or refer to "the slot's data is made durable" instead? The
use of "persist" here is confusing, because a slot's persistence
refers to it as being a *candidate* for flush (compared to an
ephemeral slot), and it does not refer to the *fact* of flushing its
data to make sure that it survives a crash. In the context of this
patch, the LSN value tracked in the slot's in-memory data refers to
the last point where the slot's data has been flushed.

+    /*
+     * This is used to track the last persisted confirmed_flush LSN value to
+     * detect any divergence in the in-memory and on-disk values for the same.
+     */

"This value tracks is the last LSN where this slot's data has been
flushed to disk. This is used during a checkpoint shutdown to decide
if a logical slot's data should be forcibly flushed or not."

Hmm. WAL senders are shut down *after* the checkpointer and *after*
the shutdown checkpoint is finished (see PM_SHUTDOWN and
PM_SHUTDOWN_2) because we want the WAL senders to acknowledge the
checkpoint record before shutting down the primary. In order to limit
the number of records to work on after a restart, what this patch is
proposing is an improvement. Perhaps it would be better to document
that we don't care about the potential concurrent activity of logical
WAL senders in this case and that the LSN we are saving at is a best
effort, meaning that last_saved_confirmed_flush is just here to reduce
the damage on a follow-up restart? The comment added in
CheckPointReplicationSlots() goes in this direction, but perhaps this
potential concurrent activity should be mentioned?
--
Michael

#58

amit.kapila16@gmail.com

over 2 years ago

In reply to: Michael Paquier (#57)

Re: persist logical slots to disk during shutdown checkpoint

On Fri, Sep 8, 2023 at 10:08 AM Michael Paquier <michael@paquier.xyz> wrote:

+    /*
+     * This is used to track the last persisted confirmed_flush LSN value to
+     * detect any divergence in the in-memory and on-disk values for the same.
+     */
"This value tracks is the last LSN where this slot's data has been
flushed to disk.

This makes the comment vague as this sounds like we are saving a slot
corresponding to some LSN which is not the case. If you prefer this
tone then we can instead say: "This value tracks the last
confirmed_flush LSN flushed which is used during a checkpoint shutdown
to decide if a logical slot's data should be forcibly flushed or not."

This is used during a checkpoint shutdown to decide
if a logical slot's data should be forcibly flushed or not."

Hmm. WAL senders are shut down *after* the checkpointer and *after*
the shutdown checkpoint is finished (see PM_SHUTDOWN and
PM_SHUTDOWN_2) because we want the WAL senders to acknowledge the
checkpoint record before shutting down the primary.

As per my understanding, this is not true for logical walsenders. As
per code, while HandleCheckpointerInterrupts(), we call ShutdownXLOG()
which sends a signal to walsender to stop and waits for it to stop.
And only after that, did it write a shutdown checkpoint WAL record.
After getting the InitStopping signal, walsender sets got_STOPPING
flag. Then *logical* walsender ensures that it sends all the pending
WAL and exits. What you have quoted is probably true for physical
walsenders.

In order to limit
the number of records to work on after a restart, what this patch is
proposing is an improvement. Perhaps it would be better to document
that we don't care about the potential concurrent activity of logical
WAL senders in this case and that the LSN we are saving at is a best
effort, meaning that last_saved_confirmed_flush is just here to reduce
the damage on a follow-up restart?

Unless I am wrong, there shouldn't be any concurrent activity for
logical walsenders. IIRC, it is a mandatory requirement for logical
walsenders to stop before shutdown checkpointer to avoid panic error.
We do handle logical walsnders differently because they can generate
WAL during decoding.

The comment added in
CheckPointReplicationSlots() goes in this direction, but perhaps this
potential concurrent activity should be mentioned?

Sure, we can change it if required.

--
With Regards,
Amit Kapila.

#59

michael@paquier.xyz

over 2 years ago

In reply to: Amit Kapila (#58)

Re: persist logical slots to disk during shutdown checkpoint

On Fri, Sep 08, 2023 at 11:50:37AM +0530, Amit Kapila wrote:

On Fri, Sep 8, 2023 at 10:08 AM Michael Paquier <michael@paquier.xyz> wrote:
+    /*
+     * This is used to track the last persisted confirmed_flush LSN value to
+     * detect any divergence in the in-memory and on-disk values for the same.
+     */
"This value tracks is the last LSN where this slot's data has been
flushed to disk.
This makes the comment vague as this sounds like we are saving a slot
corresponding to some LSN which is not the case. If you prefer this
tone then we can instead say: "This value tracks the last
confirmed_flush LSN flushed which is used during a checkpoint shutdown
to decide if a logical slot's data should be forcibly flushed or not."

Okay, that looks like an improvement over the term "persisted".

This is used during a checkpoint shutdown to decide
if a logical slot's data should be forcibly flushed or not."

Hmm. WAL senders are shut down *after* the checkpointer and *after*
the shutdown checkpoint is finished (see PM_SHUTDOWN and
PM_SHUTDOWN_2) because we want the WAL senders to acknowledge the
checkpoint record before shutting down the primary.

As per my understanding, this is not true for logical walsenders. As
per code, while HandleCheckpointerInterrupts(), we call ShutdownXLOG()
which sends a signal to walsender to stop and waits for it to stop.
And only after that, did it write a shutdown checkpoint WAL record.
After getting the InitStopping signal, walsender sets got_STOPPING
flag. Then *logical* walsender ensures that it sends all the pending
WAL and exits. What you have quoted is probably true for physical
walsenders.

Hm, reminding me about this area.. This roots down to the handling of
WalSndCaughtUp in the send_data callback for logical or physical.
This is switched to true for logical WAL senders much earlier than
physical WAL senders, aka before the shutdown checkpoint begins in the
latter. What was itching me a bit is that the postmaster logic could
be made more solid. Logical and physical WAL senders are both marked
as BACKEND_TYPE_WALSND, but we don't actually check that the WAL
senders remaining at the end of PM_SHUTDOWN_2 are *not* connected to a
database. This would require a new BACKEND_TYPE_* perhaps, or perhaps
we're fine with the current state because we'll catch up problems in
the checkpointer if any WAL is generated while the shutdown checkpoint
is running anyway. Just something I got in mind, unrelated to this
patch.

In order to limit
the number of records to work on after a restart, what this patch is
proposing is an improvement. Perhaps it would be better to document
that we don't care about the potential concurrent activity of logical
WAL senders in this case and that the LSN we are saving at is a best
effort, meaning that last_saved_confirmed_flush is just here to reduce
the damage on a follow-up restart?

Unless I am wrong, there shouldn't be any concurrent activity for
logical walsenders. IIRC, it is a mandatory requirement for logical
walsenders to stop before shutdown checkpointer to avoid panic error.
We do handle logical walsnders differently because they can generate
WAL during decoding.

Yeah. See above.

The comment added in
CheckPointReplicationSlots() goes in this direction, but perhaps this
potential concurrent activity should be mentioned?

Sure, we can change it if required.

+ * We can flush dirty replication slots at regular intervals by any
+ * background process like bgwriter but checkpoint is a convenient location.

I still don't quite see a need to mention the bgwriter at all here..
That's just unrelated.

The comment block in CheckPointReplicationSlots() from v10 uses
"persist", but you mean "flush", I guess..
--
Michael

#60

amit.kapila16@gmail.com

over 2 years ago

In reply to: Michael Paquier (#59)

1 attachment(s)

Re: persist logical slots to disk during shutdown checkpoint

On Mon, Sep 11, 2023 at 12:08 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Sep 08, 2023 at 11:50:37AM +0530, Amit Kapila wrote:
On Fri, Sep 8, 2023 at 10:08 AM Michael Paquier <michael@paquier.xyz> wrote:
+    /*
+     * This is used to track the last persisted confirmed_flush LSN value to
+     * detect any divergence in the in-memory and on-disk values for the same.
+     */
"This value tracks is the last LSN where this slot's data has been
flushed to disk.
This makes the comment vague as this sounds like we are saving a slot
corresponding to some LSN which is not the case. If you prefer this
tone then we can instead say: "This value tracks the last
confirmed_flush LSN flushed which is used during a checkpoint shutdown
to decide if a logical slot's data should be forcibly flushed or not."
Okay, that looks like an improvement over the term "persisted".

Changed accordingly.

This is used during a checkpoint shutdown to decide
if a logical slot's data should be forcibly flushed or not."

Hmm. WAL senders are shut down *after* the checkpointer and *after*
the shutdown checkpoint is finished (see PM_SHUTDOWN and
PM_SHUTDOWN_2) because we want the WAL senders to acknowledge the
checkpoint record before shutting down the primary.

As per my understanding, this is not true for logical walsenders. As
per code, while HandleCheckpointerInterrupts(), we call ShutdownXLOG()
which sends a signal to walsender to stop and waits for it to stop.
And only after that, did it write a shutdown checkpoint WAL record.
After getting the InitStopping signal, walsender sets got_STOPPING
flag. Then *logical* walsender ensures that it sends all the pending
WAL and exits. What you have quoted is probably true for physical
walsenders.

Hm, reminding me about this area.. This roots down to the handling of
WalSndCaughtUp in the send_data callback for logical or physical.
This is switched to true for logical WAL senders much earlier than
physical WAL senders, aka before the shutdown checkpoint begins in the
latter. What was itching me a bit is that the postmaster logic could
be made more solid. Logical and physical WAL senders are both marked
as BACKEND_TYPE_WALSND, but we don't actually check that the WAL
senders remaining at the end of PM_SHUTDOWN_2 are *not* connected to a
database. This would require a new BACKEND_TYPE_* perhaps, or perhaps
we're fine with the current state because we'll catch up problems in
the checkpointer if any WAL is generated while the shutdown checkpoint
is running anyway. Just something I got in mind, unrelated to this
patch.

I don't know if the difference is worth inventing a new BACKEND_TYPE_*
but if you think so then we can probably discuss this in a new thread.
I think we may want to improve some comments as a separate patch to
make this evident.

+ * We can flush dirty replication slots at regular intervals by any
+ * background process like bgwriter but checkpoint is a convenient location.
I still don't quite see a need to mention the bgwriter at all here..
That's just unrelated.

I don't disagree with it, so changed it in the attached patch.

The comment block in CheckPointReplicationSlots() from v10 uses
"persist", but you mean "flush", I guess..

This point is not very clear to me. Can you please quote the exact
comment if you think something needs to be changed?

--
With Regards,
Amit Kapila.

Attachments:

v11-0001-Persist-logical-slots-to-disk-during-a-shutdown-.patchapplication/octet-stream; name=v11-0001-Persist-logical-slots-to-disk-during-a-shutdown-.patchDownload

From 070bff5591f63fabe036de081ea6f2e3a795853e Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Fri, 14 Apr 2023 13:49:09 +0800
Subject: [PATCH v11] Persist logical slots to disk during a shutdown
 checkpoint if required.

It's entirely possible for a logical slot to have a confirmed_flush LSN
higher than the last value saved on disk while not being marked as dirty.
Currently, it is not a major problem but a later patch adding support for
the upgrade of slots relies on that value being properly persisted to disk.

It can also help avoid processing the same transactions again in some
boundary cases after the clean shutdown and restart. Say, we process some
transactions for which we didn't send anything downstream (the changes got
filtered) but the confirm_flush LSN is updated due to keepalives. As we
don't flush the latest value of confirm_flush LSN, it may lead to
processing the same changes again without this patch.

Author: Vignesh C, Julien Rouhaud, Kuroda Hayato based on suggestions by
Ashutosh Bapat
Reviewed-by: Amit Kapila, Dilip Kumar, Ashutosh Bapat, Michael Paquier, Peter Smith
Discussion: http://postgr.es/m/CAA4eK1JzJagMmb_E8D4au=GYQkxox0AfNBm1FbP7sy7t4YWXPQ@mail.gmail.com
Discussion: http://postgr.es/m/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 src/backend/access/transam/xlog.c             |   2 +-
 src/backend/replication/slot.c                |  38 ++++++-
 src/include/replication/slot.h                |   9 +-
 src/test/recovery/meson.build                 |   1 +
 .../t/038_save_logical_slots_shutdown.pl      | 102 ++++++++++++++++++
 5 files changed, 146 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/038_save_logical_slots_shutdown.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f6f8adc72a..f26c8d18a6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7039,7 +7039,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
-	CheckPointReplicationSlots();
+	CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN);
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index bb09c4010f..63b80d321c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -321,6 +321,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->candidate_xmin_lsn = InvalidXLogRecPtr;
 	slot->candidate_restart_valid = InvalidXLogRecPtr;
 	slot->candidate_restart_lsn = InvalidXLogRecPtr;
+	slot->last_saved_confirmed_flush = InvalidXLogRecPtr;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -1572,11 +1573,14 @@ restart:
 /*
  * Flush all replication slots to disk.
  *
- * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * We can flush dirty replication slots at regular intervals by any
+ * background process but checkpoint is a convenient location. Additionally,
+ * in case of a shutdown checkpoint, we also identify the slots for which
+ * confirmed_flush LSN has been updated since the last time it persisted and
+ * flush them.
  */
 void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)
 {
 	int			i;
 
@@ -1601,6 +1605,30 @@ CheckPointReplicationSlots(void)
 
 		/* save the slot to disk, locking is handled in SaveSlotToPath() */
 		sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
+
+		/*
+		 * We won't ensure that the slot is persisted after the
+		 * confirmed_flush LSN is updated as that could lead to frequent
+		 * writes. However, we need to ensure that we do persist the slots at
+		 * the time of shutdown whose confirmed_flush LSN is changed since we
+		 * last saved the slot to disk. This will help in avoiding retreat of
+		 * the confirmed_flush LSN after restart. At other times, the
+		 * walsender keeps saving the slot from time to time as the
+		 * replication progresses, so there is no clear advantage of flushing
+		 * additional slots at the time of checkpoint.
+		 */
+		if (is_shutdown && SlotIsLogical(s))
+		{
+			SpinLockAcquire(&s->mutex);
+			if (s->data.invalidated == RS_INVAL_NONE &&
+				s->data.confirmed_flush != s->last_saved_confirmed_flush)
+			{
+				s->just_dirtied = true;
+				s->dirty = true;
+			}
+			SpinLockRelease(&s->mutex);
+		}
+
 		SaveSlotToPath(s, path, LOG);
 	}
 	LWLockRelease(ReplicationSlotAllocationLock);
@@ -1873,11 +1901,12 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	/*
 	 * Successfully wrote, unset dirty bit, unless somebody dirtied again
-	 * already.
+	 * already and remember the confirmed_flush LSN value.
 	 */
 	SpinLockAcquire(&slot->mutex);
 	if (!slot->just_dirtied)
 		slot->dirty = false;
+	slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
 	SpinLockRelease(&slot->mutex);
 
 	LWLockRelease(&slot->io_in_progress_lock);
@@ -2074,6 +2103,7 @@ RestoreSlotFromDisk(const char *name)
 		/* initialize in memory state */
 		slot->effective_xmin = cp.slotdata.xmin;
 		slot->effective_catalog_xmin = cp.slotdata.catalog_xmin;
+		slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
 
 		slot->candidate_catalog_xmin = InvalidTransactionId;
 		slot->candidate_xmin_lsn = InvalidXLogRecPtr;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8a89dc784..758ca79a81 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -178,6 +178,13 @@ typedef struct ReplicationSlot
 	XLogRecPtr	candidate_xmin_lsn;
 	XLogRecPtr	candidate_restart_valid;
 	XLogRecPtr	candidate_restart_lsn;
+
+	/*
+	 * This value tracks the last confirmed_flush LSN flushed which is used
+	 * during a shutdown checkpoint to decide if logical's slot data should be
+	 * forcibly flushed or not.
+	 */
+	XLogRecPtr	last_saved_confirmed_flush;
 } ReplicationSlot;
 
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
@@ -241,7 +248,7 @@ extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslo
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
-extern void CheckPointReplicationSlots(void);
+extern void CheckPointReplicationSlots(bool is_shutdown);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index e7328e4894..646d6ffde4 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -43,6 +43,7 @@ tests += {
       't/035_standby_logical_decoding.pl',
       't/036_truncated_dropped.pl',
       't/037_invalid_database.pl',
+      't/038_save_logical_slots_shutdown.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
new file mode 100644
index 0000000000..224a840a61
--- /dev/null
+++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
@@ -0,0 +1,102 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test logical replication slots are always persisted to disk during a shutdown
+# checkpoint.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub compare_confirmed_flush
+{
+	my ($node, $confirmed_flush_from_log) = @_;
+
+	# Fetch Latest checkpoint location from the control file
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data      = split("\n", $stdout);
+	my $latest_checkpoint = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint = $1;
+			last;
+		}
+	}
+	die "Latest checkpoint location not found in control file\n"
+	  unless defined($latest_checkpoint);
+
+	# Is it same as the value read from log?
+	ok( $latest_checkpoint eq $confirmed_flush_from_log,
+		"Check that the slot's confirmed_flush LSN is the same as the latest_checkpoint location"
+	);
+
+	return;
+}
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('pub');
+$node_publisher->init(allows_streaming => 'logical');
+# Avoid checkpoint during the test, otherwise, the latest checkpoint location
+# will change.
+$node_publisher->append_conf(
+	'postgresql.conf', q{
+checkpoint_timeout = 1h
+autovacuum = off
+});
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = PostgreSQL::Test::Cluster->new('sub');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create tables
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+
+# Insert some data
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tbl VALUES (generate_series(1, 5));");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION pub FOR ALL TABLES");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub CONNECTION '$publisher_connstr' PUBLICATION pub"
+);
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub');
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tbl");
+
+is($result, qq(5), "check initial copy was done");
+
+my $offset = -s $node_publisher->logfile;
+
+# Restart the publisher to ensure that the slot will be persisted if required
+$node_publisher->restart();
+
+# Wait until the walsender creates decoding context
+$node_publisher->wait_for_log(
+	qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./,
+	$offset);
+
+# Extract confirmed_flush from the logfile
+my $log_contents = slurp_file($node_publisher->logfile, $offset);
+$log_contents =~
+  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
+  or die "could not get confirmed_flush_lsn";
+
+# Ensure that the slot's confirmed_flush LSN is the same as the
+# latest_checkpoint location.
+compare_confirmed_flush($node_publisher, $1);
+
+done_testing();
-- 
2.28.0.windows.1

#61

michael@paquier.xyz

over 2 years ago

In reply to: Amit Kapila (#60)

1 attachment(s)

Re: persist logical slots to disk during shutdown checkpoint

On Mon, Sep 11, 2023 at 02:49:49PM +0530, Amit Kapila wrote:

I don't know if the difference is worth inventing a new BACKEND_TYPE_*
but if you think so then we can probably discuss this in a new thread.
I think we may want to improve some comments as a separate patch to
make this evident.

The comments in postmaster.c could be improved, at least. There is no
need to discuss that here.

This point is not very clear to me. Can you please quote the exact
comment if you think something needs to be changed?

Hmm. Don't think that's it yet..

Please see the v11 attached, that rewords all the places of the patch
that need clarifications IMO. I've found that the comment additions
in CheckPointReplicationSlots() to be overcomplicated:
- The key point to force a flush of a slot if its confirmed_lsn has
moved ahead of the last LSN where it was saved is to make the follow
up restart more responsive.
- Not sure that there is any point to mention the other code paths in
the tree where ReplicationSlotSave() can be called, and a slot can be
saved in other processes than just WAL senders (like slot
manipulations in normal backends, for one). This was the last
sentence in v10.
- Persist is incorrect in this context in the tests, slot.c and
slot.h, as it should refer to the slot's data being flushed, saved or
just "made durable" because this is what the new last saved LSN is
here for. Persistence is a slot property, and does not refer to the
fact of flushing the data IMO.

+           if (s->data.invalidated == RS_INVAL_NONE &&
+               s->data.confirmed_flush != s->last_saved_confirmed_flush)

Actually this is incorrect, no? Shouldn't we make sure that the
confirmed_flush is strictly higher than the last saved LSN?
--
Michael

Attachments:

v11-0001-Flush-logical-slots-to-disk-during-a-shutdown-ch.patchtext/x-diff; charset=us-asciiDownload

From 0aa4c8245359b60001ed90de1193dbb8b2b6c91c Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Fri, 14 Apr 2023 13:49:09 +0800
Subject: [PATCH v11] Flush logical slots to disk during a shutdown checkpoint
 if required.

It's entirely possible for a logical slot to have a confirmed_flush LSN
higher than the last value saved on disk while not being marked as dirty.
Currently, it is not a major problem but a later patch adding support for
the upgrade of slots relies on that value being properly flushed to disk.

It can also help avoid processing the same transactions again in some
boundary cases after the clean shutdown and restart.  Say, we process
some transactions for which we didn't send anything downstream (the
changes got filtered) but the confirm_flush LSN is updated due to
keepalives.  As we don't flush the latest value of confirm_flush LSN, it
may lead to processing the same changes again without this patch.

The approach taken by this patch has been suggested by Ashutosh Bapat.

Author: Vignesh C, Julien Rouhaud, Kuroda Hayato
Reviewed-by: Amit Kapila, Dilip Kumar, Ashutosh Bapat, Peter Smith
Discussion: http://postgr.es/m/CAA4eK1JzJagMmb_E8D4au=GYQkxox0AfNBm1FbP7sy7t4YWXPQ@mail.gmail.com
Discussion: http://postgr.es/m/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 src/include/replication/slot.h                |   8 +-
 src/backend/access/transam/xlog.c             |   2 +-
 src/backend/replication/slot.c                |  32 +++++-
 src/test/recovery/meson.build                 |   1 +
 .../t/038_save_logical_slots_shutdown.pl      | 102 ++++++++++++++++++
 5 files changed, 140 insertions(+), 5 deletions(-)
 create mode 100644 src/test/recovery/t/038_save_logical_slots_shutdown.pl

diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8a89dc784..5e60030234 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -178,6 +178,12 @@ typedef struct ReplicationSlot
 	XLogRecPtr	candidate_xmin_lsn;
 	XLogRecPtr	candidate_restart_valid;
 	XLogRecPtr	candidate_restart_lsn;
+
+	/*
+	 * LSN used to track the last confirmed_flush LSN where the slot's data
+	 * has been flushed to disk.
+	 */
+	XLogRecPtr	last_saved_confirmed_flush;
 } ReplicationSlot;
 
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
@@ -241,7 +247,7 @@ extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslo
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
-extern void CheckPointReplicationSlots(void);
+extern void CheckPointReplicationSlots(bool is_shutdown);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f6f8adc72a..f26c8d18a6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7039,7 +7039,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
-	CheckPointReplicationSlots();
+	CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN);
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index bb09c4010f..a31c7867cf 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -321,6 +321,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->candidate_xmin_lsn = InvalidXLogRecPtr;
 	slot->candidate_restart_valid = InvalidXLogRecPtr;
 	slot->candidate_restart_lsn = InvalidXLogRecPtr;
+	slot->last_saved_confirmed_flush = InvalidXLogRecPtr;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -1573,10 +1574,12 @@ restart:
  * Flush all replication slots to disk.
  *
  * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * location.  Additionally, in case of a shutdown checkpoint, we also identify
+ * the slots for which the confirmed_flush LSN has been updated since the last
+ * time it was saved and flush them.
  */
 void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)
 {
 	int			i;
 
@@ -1601,6 +1604,27 @@ CheckPointReplicationSlots(void)
 
 		/* save the slot to disk, locking is handled in SaveSlotToPath() */
 		sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
+
+		/*
+		 * Slot's data is not flushed each time the confirmed_flush LSN is
+		 * updated as that could lead to frequent writes.  However, we decide
+		 * to force a flush of all logical slot's data at the time of shutdown
+		 * if the confirmed_flush LSN is changed since we last flushed it to
+		 * disk.  This helps in avoiding an unnecessary retreat of the
+		 * confirmed_flush LSN after restart.
+		 */
+		if (is_shutdown && SlotIsLogical(s))
+		{
+			SpinLockAcquire(&s->mutex);
+			if (s->data.invalidated == RS_INVAL_NONE &&
+				s->data.confirmed_flush > s->last_saved_confirmed_flush)
+			{
+				s->just_dirtied = true;
+				s->dirty = true;
+			}
+			SpinLockRelease(&s->mutex);
+		}
+
 		SaveSlotToPath(s, path, LOG);
 	}
 	LWLockRelease(ReplicationSlotAllocationLock);
@@ -1873,11 +1897,12 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	/*
 	 * Successfully wrote, unset dirty bit, unless somebody dirtied again
-	 * already.
+	 * already and remember the confirmed_flush LSN value.
 	 */
 	SpinLockAcquire(&slot->mutex);
 	if (!slot->just_dirtied)
 		slot->dirty = false;
+	slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
 	SpinLockRelease(&slot->mutex);
 
 	LWLockRelease(&slot->io_in_progress_lock);
@@ -2074,6 +2099,7 @@ RestoreSlotFromDisk(const char *name)
 		/* initialize in memory state */
 		slot->effective_xmin = cp.slotdata.xmin;
 		slot->effective_catalog_xmin = cp.slotdata.catalog_xmin;
+		slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
 
 		slot->candidate_catalog_xmin = InvalidTransactionId;
 		slot->candidate_xmin_lsn = InvalidXLogRecPtr;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index e7328e4894..646d6ffde4 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -43,6 +43,7 @@ tests += {
       't/035_standby_logical_decoding.pl',
       't/036_truncated_dropped.pl',
       't/037_invalid_database.pl',
+      't/038_save_logical_slots_shutdown.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
new file mode 100644
index 0000000000..de19829560
--- /dev/null
+++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
@@ -0,0 +1,102 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test logical replication slots are always flushed to disk during a shutdown
+# checkpoint.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub compare_confirmed_flush
+{
+	my ($node, $confirmed_flush_from_log) = @_;
+
+	# Fetch Latest checkpoint location from the control file
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $latest_checkpoint = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint = $1;
+			last;
+		}
+	}
+	die "Latest checkpoint location not found in control file\n"
+	  unless defined($latest_checkpoint);
+
+	# Is it same as the value read from log?
+	ok( $latest_checkpoint eq $confirmed_flush_from_log,
+		"Check that the slot's confirmed_flush LSN is the same as the latest_checkpoint location"
+	);
+
+	return;
+}
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('pub');
+$node_publisher->init(allows_streaming => 'logical');
+# Avoid checkpoint during the test, otherwise, the latest checkpoint location
+# will change.
+$node_publisher->append_conf(
+	'postgresql.conf', q{
+checkpoint_timeout = 1h
+autovacuum = off
+});
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = PostgreSQL::Test::Cluster->new('sub');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create tables
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+
+# Insert some data
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tbl VALUES (generate_series(1, 5));");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION pub FOR ALL TABLES");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub CONNECTION '$publisher_connstr' PUBLICATION pub"
+);
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub');
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tbl");
+
+is($result, qq(5), "check initial copy was done");
+
+my $offset = -s $node_publisher->logfile;
+
+# Restart the publisher to ensure that the slot will be flushed if required
+$node_publisher->restart();
+
+# Wait until the walsender creates decoding context
+$node_publisher->wait_for_log(
+	qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./,
+	$offset);
+
+# Extract confirmed_flush from the logfile
+my $log_contents = slurp_file($node_publisher->logfile, $offset);
+$log_contents =~
+  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
+  or die "could not get confirmed_flush_lsn";
+
+# Ensure that the slot's confirmed_flush LSN is the same as the
+# latest_checkpoint location.
+compare_confirmed_flush($node_publisher, $1);
+
+done_testing();
-- 
2.40.1

#62

amit.kapila16@gmail.com

over 2 years ago

In reply to: Michael Paquier (#61)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, Sep 12, 2023 at 10:55 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Sep 11, 2023 at 02:49:49PM +0530, Amit Kapila wrote:

Please see the v11 attached, that rewords all the places of the patch
that need clarifications IMO. I've found that the comment additions
in CheckPointReplicationSlots() to be overcomplicated:
- The key point to force a flush of a slot if its confirmed_lsn has
moved ahead of the last LSN where it was saved is to make the follow
up restart more responsive.

I don't think it will become more responsive in any way, not sure what
made you think like that. The key idea is that after restart we want
to ensure that all the WAL data up to the shutdown checkpoint record
is sent to downstream. As mentioned in the commit message, this will
help in ensuring that upgrades don't miss any data and then there is
another small advantage as mentioned in the commit message.

- Not sure that there is any point to mention the other code paths in
the tree where ReplicationSlotSave() can be called, and a slot can be
saved in other processes than just WAL senders (like slot
manipulations in normal backends, for one). This was the last
sentence in v10.

The point was the earlier sentence is no longer true and keeping it as
it is could be wrong or at least misleading. For example, earlier it
is okay to say, "This needn't actually be part of a checkpoint, ..."
but now that is no longer true as we want to invoke this at the time
of shutdown checkpoint for correctness. If we want to be precise, we
can say, "It is convenient to flush dirty replication slots at the
time of checkpoint. Additionally, .."

+           if (s->data.invalidated == RS_INVAL_NONE &&
+               s->data.confirmed_flush != s->last_saved_confirmed_flush)
Actually this is incorrect, no? Shouldn't we make sure that the
confirmed_flush is strictly higher than the last saved LSN?

I can't see why it is incorrect. Do you see how (in what scenario) it
could go wrong? As per my understanding, confirmed_flush LSN will
always be greater than equal to last_saved_confirmed_flush but we
don't want to ensure that point here because we just want if the
latest value is not the same then we should mark the slot dirty and
flush it as that will be location we have ensured to update before
walsender shutdown. I think it is better to add an assert if you are
worried about any such case and we had thought of adding it as well
but then didn't do it because we don't have matching asserts to ensure
that we never assign prior LSN value to consfirmed_flush LSN.

+ /*
+ * LSN used to track the last confirmed_flush LSN where the slot's data
+ * has been flushed to disk.
+ */
+ XLogRecPtr last_saved_confirmed_flush;

I don't want to argue on such a point because it is a little bit of a
matter of personal choice but I find this comment unclear. It seems to
read that confirmed_flush LSN is some LSN position which is where we
flushed the slot's data and that is not true. I found the last comment
in the patch sent by me: "This value tracks the last confirmed_flush
LSN flushed which is used during a shutdown checkpoint to decide if
logical's slot data should be forcibly flushed or not." which I feel
we agreed upon is better.

--
With Regards,
Amit Kapila.

#63

michael@paquier.xyz

over 2 years ago

In reply to: Amit Kapila (#62)

Re: persist logical slots to disk during shutdown checkpoint

On Tue, Sep 12, 2023 at 03:15:44PM +0530, Amit Kapila wrote:

I don't think it will become more responsive in any way, not sure what
made you think like that. The key idea is that after restart we want
to ensure that all the WAL data up to the shutdown checkpoint record
is sent to downstream. As mentioned in the commit message, this will
help in ensuring that upgrades don't miss any data and then there is
another small advantage as mentioned in the commit message.

Good thing I did not use the term "responsive" in the previous patch I
posted. My apologies if you found that confusing. Let's say, "to
prevent unnecessary retreat", then ;)

- Not sure that there is any point to mention the other code paths in
the tree where ReplicationSlotSave() can be called, and a slot can be
saved in other processes than just WAL senders (like slot
manipulations in normal backends, for one). This was the last
sentence in v10.

The point was the earlier sentence is no longer true and keeping it as
it is could be wrong or at least misleading. For example, earlier it
is okay to say, "This needn't actually be part of a checkpoint, ..."
but now that is no longer true as we want to invoke this at the time
of shutdown checkpoint for correctness. If we want to be precise, we

How so? This is just a reference about the fact that using a
checkpoint path for this stuff is useful. A shutdown checkpoint is
still a checkpoint, done by the checkpointer. The background writer
is not concerned by that.

can say, "It is convenient to flush dirty replication slots at the
time of checkpoint. Additionally, .."

Okay by mr to reword the top comment of CheckPointReplicationSlots()
to use these terms, if you feel strongly about it.

+           if (s->data.invalidated == RS_INVAL_NONE &&
+               s->data.confirmed_flush != s->last_saved_confirmed_flush)
Actually this is incorrect, no? Shouldn't we make sure that the
confirmed_flush is strictly higher than the last saved LSN?
I can't see why it is incorrect. Do you see how (in what scenario) it
could go wrong? As per my understanding, confirmed_flush LSN will
always be greater than equal to last_saved_confirmed_flush but we
don't want to ensure that point here because we just want if the
latest value is not the same then we should mark the slot dirty and
flush it as that will be location we have ensured to update before
walsender shutdown. I think it is better to add an assert if you are
worried about any such case and we had thought of adding it as well
but then didn't do it because we don't have matching asserts to ensure
that we never assign prior LSN value to consfirmed_flush LSN.

Because that's just safer in the long run, and I don't see why we
cannot just do that? Imagine, for instance, that a bug doing an
incorrect manipulation of a logical slot's data does an incorrect
computation of this field, and that we finish with in-memory data
that's older than what was previously saved. The code may cause a
flush at an incorrect, past, position. That's just an assumption from
my side, of course.

I don't want to argue on such a point because it is a little bit of a
matter of personal choice but I find this comment unclear. It seems to
read that confirmed_flush LSN is some LSN position which is where we
flushed the slot's data and that is not true. I found the last comment
in the patch sent by me: "This value tracks the last confirmed_flush
LSN flushed which is used during a shutdown checkpoint to decide if
logical's slot data should be forcibly flushed or not." which I feel
we agreed upon is better.

Okay, fine by me here.
--
Michael

#64

amit.kapila16@gmail.com

over 2 years ago

In reply to: Michael Paquier (#63)

Re: persist logical slots to disk during shutdown checkpoint

On Wed, Sep 13, 2023 at 10:57 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Sep 12, 2023 at 03:15:44PM +0530, Amit Kapila wrote:

- Not sure that there is any point to mention the other code paths in
the tree where ReplicationSlotSave() can be called, and a slot can be
saved in other processes than just WAL senders (like slot
manipulations in normal backends, for one). This was the last
sentence in v10.

The point was the earlier sentence is no longer true and keeping it as
it is could be wrong or at least misleading. For example, earlier it
is okay to say, "This needn't actually be part of a checkpoint, ..."
but now that is no longer true as we want to invoke this at the time
of shutdown checkpoint for correctness. If we want to be precise, we

How so?

Consider if we move this call to bgwriter (aka flushing slots is no
longer part of a checkpoint), Would that be okay? Previously, I think
it was okay but not now. I see an argument to keep that as it is as
well because we have already mentioned the special shutdown checkpoint
case. By the way, I have changed this because Ashutosh felt it is no
longer correct to keep the first sentence as it is. See his email[1]
(Relying on the first sentence, ...). It happened previously as well
that different reviewers working in this area have different views on
this sort of thing. I am trying my best to address the review
comments, especially from experienced hackers but personally, I feel
this is a minor nitpick and isn't worth too much argument, either way,
should be okay.

+           if (s->data.invalidated == RS_INVAL_NONE &&
+               s->data.confirmed_flush != s->last_saved_confirmed_flush)
Actually this is incorrect, no? Shouldn't we make sure that the
confirmed_flush is strictly higher than the last saved LSN?
I can't see why it is incorrect. Do you see how (in what scenario) it
could go wrong? As per my understanding, confirmed_flush LSN will
always be greater than equal to last_saved_confirmed_flush but we
don't want to ensure that point here because we just want if the
latest value is not the same then we should mark the slot dirty and
flush it as that will be location we have ensured to update before
walsender shutdown. I think it is better to add an assert if you are
worried about any such case and we had thought of adding it as well
but then didn't do it because we don't have matching asserts to ensure
that we never assign prior LSN value to consfirmed_flush LSN.
Because that's just safer in the long run, and I don't see why we
cannot just do that? Imagine, for instance, that a bug doing an
incorrect manipulation of a logical slot's data does an incorrect
computation of this field, and that we finish with in-memory data
that's older than what was previously saved. The code may cause a
flush at an incorrect, past, position. That's just an assumption from
my side, of course.

If you are worried about such bugs, it would be better to have an
Assert as suggested previously rather than greater than check because
we will at least catch such bugs otherwise it can go unnoticed or in
the worst case will lead to unknown consequences. I am saying this
because if there are such bugs (or got introduced later) then the slot
can be flushed with a prior confirmed_flush location even from other
code paths. Just for reference, we don't have any check ensuring that
confirmed_flush LSN can move backward in function
LogicalConfirmReceivedLocation(), see and also another place where we
update it:
else
{
SpinLockAcquire(&MyReplicationSlot->mutex);
MyReplicationSlot->data.confirmed_flush = lsn;
SpinLockRelease(&MyReplicationSlot->mutex);
}

As other places don't have an assert, I didn't add one here but we can
add one here.

I don't want to argue on such a point because it is a little bit of a
matter of personal choice but I find this comment unclear. It seems to
read that confirmed_flush LSN is some LSN position which is where we
flushed the slot's data and that is not true. I found the last comment
in the patch sent by me: "This value tracks the last confirmed_flush
LSN flushed which is used during a shutdown checkpoint to decide if
logical's slot data should be forcibly flushed or not." which I feel
we agreed upon is better.

Okay, fine by me here.

Thanks, will change once we agree on the remaining points.

--
With Regards,
Amit Kapila.

#65

michael@paquier.xyz

over 2 years ago

In reply to: Amit Kapila (#64)

Re: persist logical slots to disk during shutdown checkpoint

On Wed, Sep 13, 2023 at 12:07:12PM +0530, Amit Kapila wrote:

Consider if we move this call to bgwriter (aka flushing slots is no
longer part of a checkpoint), Would that be okay? Previously, I think
it was okay but not now. I see an argument to keep that as it is as
well because we have already mentioned the special shutdown checkpoint
case. By the way, I have changed this because Ashutosh felt it is no
longer correct to keep the first sentence as it is. See his email[1]
(Relying on the first sentence, ...).

Hmmm.. Okay..

As other places don't have an assert, I didn't add one here but we can
add one here.

I'd be OK with an assertion here at the end, though I'd still choose a
stricter run-time check if I were to apply that myself.
--
Michael

#66

amit.kapila16@gmail.com

over 2 years ago

In reply to: Michael Paquier (#65)

1 attachment(s)

Re: persist logical slots to disk during shutdown checkpoint

On Wed, Sep 13, 2023 at 12:45 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Sep 13, 2023 at 12:07:12PM +0530, Amit Kapila wrote:

Consider if we move this call to bgwriter (aka flushing slots is no
longer part of a checkpoint), Would that be okay? Previously, I think
it was okay but not now. I see an argument to keep that as it is as
well because we have already mentioned the special shutdown checkpoint
case. By the way, I have changed this because Ashutosh felt it is no
longer correct to keep the first sentence as it is. See his email[1]
(Relying on the first sentence, ...).

Hmmm.. Okay..

The patch is updated as per recent discussion.

--
With Regards,
Amit Kapila.

Attachments:

v12-0001-Flush-logical-slots-to-disk-during-a-shutdown-ch.patchapplication/octet-stream; name=v12-0001-Flush-logical-slots-to-disk-during-a-shutdown-ch.patchDownload

From 8ef35462ed0b28925df9d9650bea8ea32e440b82 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Fri, 14 Apr 2023 13:49:09 +0800
Subject: [PATCH v12] Flush logical slots to disk during a shutdown checkpoint
 if required.

It's entirely possible for a logical slot to have a confirmed_flush LSN
higher than the last value saved on disk while not being marked as dirty.
Currently, it is not a major problem but a later patch adding support for
the upgrade of slots relies on that value being properly flushed to disk.

It can also help avoid processing the same transactions again in some
boundary cases after the clean shutdown and restart.  Say, we process
some transactions for which we didn't send anything downstream (the
changes got filtered) but the confirm_flush LSN is updated due to
keepalives.  As we don't flush the latest value of confirm_flush LSN, it
may lead to processing the same changes again without this patch.

The approach taken by this patch has been suggested by Ashutosh Bapat.

Author: Vignesh C, Julien Rouhaud, Kuroda Hayato
Reviewed-by: Amit Kapila, Dilip Kumar, Ashutosh Bapat, Peter Smith, Hou Zhijie
Discussion: http://postgr.es/m/CAA4eK1JzJagMmb_E8D4au=GYQkxox0AfNBm1FbP7sy7t4YWXPQ@mail.gmail.com
Discussion: http://postgr.es/m/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 src/backend/access/transam/xlog.c             |   2 +-
 src/backend/replication/slot.c                |  37 ++++++-
 src/include/replication/slot.h                |   9 +-
 src/test/recovery/meson.build                 |   1 +
 .../t/038_save_logical_slots_shutdown.pl      | 102 ++++++++++++++++++
 5 files changed, 145 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/038_save_logical_slots_shutdown.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f6f8adc72a..f26c8d18a6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7039,7 +7039,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
-	CheckPointReplicationSlots();
+	CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN);
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index bb09c4010f..3ded3c1473 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -321,6 +321,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->candidate_xmin_lsn = InvalidXLogRecPtr;
 	slot->candidate_restart_valid = InvalidXLogRecPtr;
 	slot->candidate_restart_lsn = InvalidXLogRecPtr;
+	slot->last_saved_confirmed_flush = InvalidXLogRecPtr;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -1572,11 +1573,13 @@ restart:
 /*
  * Flush all replication slots to disk.
  *
- * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * It is convenient to flush dirty replication slots at the time of checkpoint.
+ * Additionally, in case of a shutdown checkpoint, we also identify the slots
+ * for which the confirmed_flush LSN has been updated since the last time it
+ * was saved and flush them.
  */
 void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)
 {
 	int			i;
 
@@ -1601,6 +1604,30 @@ CheckPointReplicationSlots(void)
 
 		/* save the slot to disk, locking is handled in SaveSlotToPath() */
 		sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
+
+		/*
+		 * Slot's data is not flushed each time the confirmed_flush LSN is
+		 * updated as that could lead to frequent writes.  However, we decide
+		 * to force a flush of all logical slot's data at the time of shutdown
+		 * if the confirmed_flush LSN is changed since we last flushed it to
+		 * disk.  This helps in avoiding an unnecessary retreat of the
+		 * confirmed_flush LSN after restart.
+		 */
+		if (is_shutdown && SlotIsLogical(s))
+		{
+			SpinLockAcquire(&s->mutex);
+
+			Assert(s->data.confirmed_flush >= s->last_saved_confirmed_flush);
+
+			if (s->data.invalidated == RS_INVAL_NONE &&
+				s->data.confirmed_flush != s->last_saved_confirmed_flush)
+			{
+				s->just_dirtied = true;
+				s->dirty = true;
+			}
+			SpinLockRelease(&s->mutex);
+		}
+
 		SaveSlotToPath(s, path, LOG);
 	}
 	LWLockRelease(ReplicationSlotAllocationLock);
@@ -1873,11 +1900,12 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 
 	/*
 	 * Successfully wrote, unset dirty bit, unless somebody dirtied again
-	 * already.
+	 * already and remember the confirmed_flush LSN value.
 	 */
 	SpinLockAcquire(&slot->mutex);
 	if (!slot->just_dirtied)
 		slot->dirty = false;
+	slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
 	SpinLockRelease(&slot->mutex);
 
 	LWLockRelease(&slot->io_in_progress_lock);
@@ -2074,6 +2102,7 @@ RestoreSlotFromDisk(const char *name)
 		/* initialize in memory state */
 		slot->effective_xmin = cp.slotdata.xmin;
 		slot->effective_catalog_xmin = cp.slotdata.catalog_xmin;
+		slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
 
 		slot->candidate_catalog_xmin = InvalidTransactionId;
 		slot->candidate_xmin_lsn = InvalidXLogRecPtr;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8a89dc784..758ca79a81 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -178,6 +178,13 @@ typedef struct ReplicationSlot
 	XLogRecPtr	candidate_xmin_lsn;
 	XLogRecPtr	candidate_restart_valid;
 	XLogRecPtr	candidate_restart_lsn;
+
+	/*
+	 * This value tracks the last confirmed_flush LSN flushed which is used
+	 * during a shutdown checkpoint to decide if logical's slot data should be
+	 * forcibly flushed or not.
+	 */
+	XLogRecPtr	last_saved_confirmed_flush;
 } ReplicationSlot;
 
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
@@ -241,7 +248,7 @@ extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslo
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
-extern void CheckPointReplicationSlots(void);
+extern void CheckPointReplicationSlots(bool is_shutdown);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index e7328e4894..646d6ffde4 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -43,6 +43,7 @@ tests += {
       't/035_standby_logical_decoding.pl',
       't/036_truncated_dropped.pl',
       't/037_invalid_database.pl',
+      't/038_save_logical_slots_shutdown.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
new file mode 100644
index 0000000000..de19829560
--- /dev/null
+++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
@@ -0,0 +1,102 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test logical replication slots are always flushed to disk during a shutdown
+# checkpoint.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub compare_confirmed_flush
+{
+	my ($node, $confirmed_flush_from_log) = @_;
+
+	# Fetch Latest checkpoint location from the control file
+	my ($stdout, $stderr) =
+	  run_command([ 'pg_controldata', $node->data_dir ]);
+	my @control_data = split("\n", $stdout);
+	my $latest_checkpoint = undef;
+	foreach (@control_data)
+	{
+		if ($_ =~ /^Latest checkpoint location:\s*(.*)$/mg)
+		{
+			$latest_checkpoint = $1;
+			last;
+		}
+	}
+	die "Latest checkpoint location not found in control file\n"
+	  unless defined($latest_checkpoint);
+
+	# Is it same as the value read from log?
+	ok( $latest_checkpoint eq $confirmed_flush_from_log,
+		"Check that the slot's confirmed_flush LSN is the same as the latest_checkpoint location"
+	);
+
+	return;
+}
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('pub');
+$node_publisher->init(allows_streaming => 'logical');
+# Avoid checkpoint during the test, otherwise, the latest checkpoint location
+# will change.
+$node_publisher->append_conf(
+	'postgresql.conf', q{
+checkpoint_timeout = 1h
+autovacuum = off
+});
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = PostgreSQL::Test::Cluster->new('sub');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create tables
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
+
+# Insert some data
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tbl VALUES (generate_series(1, 5));");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION pub FOR ALL TABLES");
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub CONNECTION '$publisher_connstr' PUBLICATION pub"
+);
+
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'sub');
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tbl");
+
+is($result, qq(5), "check initial copy was done");
+
+my $offset = -s $node_publisher->logfile;
+
+# Restart the publisher to ensure that the slot will be flushed if required
+$node_publisher->restart();
+
+# Wait until the walsender creates decoding context
+$node_publisher->wait_for_log(
+	qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./,
+	$offset);
+
+# Extract confirmed_flush from the logfile
+my $log_contents = slurp_file($node_publisher->logfile, $offset);
+$log_contents =~
+  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
+  or die "could not get confirmed_flush_lsn";
+
+# Ensure that the slot's confirmed_flush LSN is the same as the
+# latest_checkpoint location.
+compare_confirmed_flush($node_publisher, $1);
+
+done_testing();
-- 
2.30.0.windows.2

#67

michael@paquier.xyz

over 2 years ago

In reply to: Amit Kapila (#66)

Re: persist logical slots to disk during shutdown checkpoint

On Wed, Sep 13, 2023 at 04:20:37PM +0530, Amit Kapila wrote:

The patch is updated as per recent discussion.

WFM. Thanks for the updated version.
--
Michael

#68

amit.kapila16@gmail.com

over 2 years ago

In reply to: Michael Paquier (#67)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Sep 14, 2023 at 7:20 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Sep 13, 2023 at 04:20:37PM +0530, Amit Kapila wrote:

The patch is updated as per recent discussion.

WFM. Thanks for the updated version.

Pushed.

--
With Regards,
Amit Kapila.

#69

ashutosh.bapat.oss@gmail.com

over 2 years ago

In reply to: Amit Kapila (#68)

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Sep 14, 2023 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 14, 2023 at 7:20 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Sep 13, 2023 at 04:20:37PM +0530, Amit Kapila wrote:

The patch is updated as per recent discussion.

WFM. Thanks for the updated version.

Pushed.

Commitfest entry "https://commitfest.postgresql.org/44/4536/ is in
"Ready for committer" state. Is there something remaining here? We
should probably set it as "committed".

--
Best Wishes,
Ashutosh Bapat

#70