Doc: fix the note related to the GUC "synchronized_standby_slots"

Started by Nonameover 1 year ago14 messages
#1Noname
Masahiro.Ikeda@nttdata.com
2 attachment(s)

Hi,

When I read the following documentation related to the "synchronized_standby_slots", I misunderstood that data loss would not occur in the case of synchronous physical replication. However, this is incorrect (see reproduce.txt).

Note that in the case of asynchronous replication, there remains a risk of data loss for transactions committed on the former primary server but have yet to be replicated to the new primary server.

https://www.postgresql.org/docs/17/logical-replication-failover.html

Am I missing something? IIUC, could you change the documentation as suggested in the attached patch? I also believe it would be better to move the sentence to the next paragraph because the note is related to "synchronized_standby_slots.".

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v1-0001-fix-documentation-related-to-synchronized_standby.patchapplication/octet-stream; name=v1-0001-fix-documentation-related-to-synchronized_standby.patchDownload
From b48c68914b687c150447e8e2c382374d754a20b5 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda <Masahiro.Ikeda@nttdata.com>
Date: Mon, 26 Aug 2024 16:42:40 +0900
Subject: [PATCH v1] fix documentation related to synchronized_standby_slots

---
 doc/src/sgml/logical-replication.sgml | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index bee7e02983b..a355ad34275 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -701,18 +701,17 @@ ALTER SUBSCRIPTION
    <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link>
    parameter ensures a seamless transition of those subscriptions after the
    standby is promoted. They can continue subscribing to publications on the
-   new primary server without losing data. Note that in the case of
-   asynchronous replication, there remains a risk of data loss for transactions
-   committed on the former primary server but have yet to be replicated to the new
-   primary server.
+   new primary server without losing data.
   </para>
 
   <para>
-   Because the slot synchronization logic copies asynchronously, it is
-   necessary to confirm that replication slots have been synced to the standby
-   server before the failover happens. To ensure a successful failover, the
-   standby server must be ahead of the subscriber. This can be achieved by
-   configuring
+   Note that there remains a risk of data loss for transactions committed on the
+   former primary server but have yet to be replicated to the new primary server even
+   in the case of synchronous physical replication. Because the slot synchronization
+   logic copies asynchronously, it is necessary to confirm that replication slots
+   have been synced to the standby server before the failover happens. To ensure a
+   successful failover, the standby server must be ahead of the subscriber. This
+   can be achieved by configuring
    <link linkend="guc-synchronized-standby-slots"><varname>synchronized_standby_slots</varname></link>.
   </para>
 
-- 
2.34.1

reproduce.txttext/plain; name=reproduce.txtDownload
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Noname (#1)
Re: Doc: fix the note related to the GUC "synchronized_standby_slots"

On Mon, Aug 26, 2024 at 1:30 PM <Masahiro.Ikeda@nttdata.com> wrote:

When I read the following documentation related to the "synchronized_standby_slots", I misunderstood that data loss would not occur in the case of synchronous physical replication. However, this is incorrect (see reproduce.txt).

Note that in the case of asynchronous replication, there remains a risk of data loss for transactions committed on the former primary server but have yet to be replicated to the new primary server.

https://www.postgresql.org/docs/17/logical-replication-failover.html

Am I missing something?

It seems part of the paragraph: "Note that in the case of asynchronous
replication, there remains a risk of data loss for transactions
committed on the former primary server but have yet to be replicated
to the new primary server." is a bit confusing. Will it make things
clear to me if we remove that part?

I am keeping a few others involved in this feature development in Cc.

--
With Regards,
Amit Kapila.

#3Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#2)
RE: Doc: fix the note related to the GUC "synchronized_standby_slots"

On Monday, August 26, 2024 5:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 26, 2024 at 1:30 PM <Masahiro.Ikeda@nttdata.com> wrote:

When I read the following documentation related to the

"synchronized_standby_slots", I misunderstood that data loss would not occur
in the case of synchronous physical replication. However, this is incorrect (see
reproduce.txt).

Note that in the case of asynchronous replication, there remains a risk of

data loss for transactions committed on the former primary server but have yet
to be replicated to the new primary server.

https://www.postgresql.org/docs/17/logical-replication-failover.html

Am I missing something?

It seems part of the paragraph: "Note that in the case of asynchronous
replication, there remains a risk of data loss for transactions committed on the
former primary server but have yet to be replicated to the new primary server." is
a bit confusing. Will it make things clear to me if we remove that part?

I think the intention is to address a complaint[1]/messages/by-id/ZfRe2+OxMS0kvNvx@ip-10-97-1-34.eu-west-3.compute.internal that the date inserted on
primary after the primary disconnects with the standby is still lost after
failover. But after rethinking, maybe it's doesn't directly belong to the topic in
the logical failover section because it's a general fact for async replication.
If we think it matters, maybe we can remove this part and slightly modify
another part:

   parameter ensures a seamless transition of those subscriptions after the
   standby is promoted. They can continue subscribing to publications on the
-   new primary server without losing data.
+   new primary server without losing that has already been replicated and
+    flushed on the standby server.

[1]: /messages/by-id/ZfRe2+OxMS0kvNvx@ip-10-97-1-34.eu-west-3.compute.internal

Best Regards,
Hou zj

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Noname (#1)
Re: Doc: fix the note related to the GUC "synchronized_standby_slots"

On Mon, Aug 26, 2024 at 1:30 PM <Masahiro.Ikeda@nttdata.com> wrote:

When I read the following documentation related to the "synchronized_standby_slots", I misunderstood that data loss would not occur in the case of synchronous physical replication. However, this is incorrect (see reproduce.txt).

I think you see such a behavior because you have disabled
'synchronized_standby_slots' in your script (# disable
"synchronized_standby_slots"). You need to enable that to avoid data
loss. Considering that, I don't think your proposed text is an
improvement.

--
With Regards,
Amit Kapila.

#5Amit Kapila
amit.kapila16@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#3)
Re: Doc: fix the note related to the GUC "synchronized_standby_slots"

On Mon, Aug 26, 2024 at 6:38 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Monday, August 26, 2024 5:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 26, 2024 at 1:30 PM <Masahiro.Ikeda@nttdata.com> wrote:

When I read the following documentation related to the

"synchronized_standby_slots", I misunderstood that data loss would not occur
in the case of synchronous physical replication. However, this is incorrect (see
reproduce.txt).

Note that in the case of asynchronous replication, there remains a risk of

data loss for transactions committed on the former primary server but have yet
to be replicated to the new primary server.

https://www.postgresql.org/docs/17/logical-replication-failover.html

Am I missing something?

It seems part of the paragraph: "Note that in the case of asynchronous
replication, there remains a risk of data loss for transactions committed on the
former primary server but have yet to be replicated to the new primary server." is
a bit confusing. Will it make things clear to me if we remove that part?

I think the intention is to address a complaint[1] that the date inserted on
primary after the primary disconnects with the standby is still lost after
failover. But after rethinking, maybe it's doesn't directly belong to the topic in
the logical failover section because it's a general fact for async replication.
If we think it matters, maybe we can remove this part and slightly modify
another part:

parameter ensures a seamless transition of those subscriptions after the
standby is promoted. They can continue subscribing to publications on the
-   new primary server without losing data.
+   new primary server without losing that has already been replicated and
+    flushed on the standby server.

Yeah, we can change that way but not sure if that satisfies the OP's
concern. I am waiting for his response.

--
With Regards,
Amit Kapila.

#6David G. Johnston
david.g.johnston@gmail.com
In reply to: Amit Kapila (#5)
Re: Doc: fix the note related to the GUC "synchronized_standby_slots"

On Monday, August 26, 2024, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 26, 2024 at 6:38 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Monday, August 26, 2024 5:37 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Mon, Aug 26, 2024 at 1:30 PM <Masahiro.Ikeda@nttdata.com> wrote:

When I read the following documentation related to the

"synchronized_standby_slots", I misunderstood that data loss would not

occur

in the case of synchronous physical replication. However, this is

incorrect (see

reproduce.txt).

Note that in the case of asynchronous replication, there remains a

risk of

data loss for transactions committed on the former primary server but

have yet

to be replicated to the new primary server.

https://www.postgresql.org/docs/17/logical-replication-failover.html

Am I missing something?

It seems part of the paragraph: "Note that in the case of asynchronous
replication, there remains a risk of data loss for transactions

committed on the

former primary server but have yet to be replicated to the new primary

server." is

a bit confusing. Will it make things clear to me if we remove that

part?

I think the intention is to address a complaint[1] that the date

inserted on

primary after the primary disconnects with the standby is still lost

after

failover. But after rethinking, maybe it's doesn't directly belong to

the topic in

the logical failover section because it's a general fact for async

replication.

If we think it matters, maybe we can remove this part and slightly modify
another part:

parameter ensures a seamless transition of those subscriptions after

the

standby is promoted. They can continue subscribing to publications on

the

-   new primary server without losing data.
+   new primary server without losing that has already been replicated

and

+ flushed on the standby server.

Yeah, we can change that way but not sure if that satisfies the OP's
concern. I am waiting for his response.

I’d suggest getting rid of all mention of “without losing data” and just
emphasize the fact that the subscribers can operate in a hot-standby
publishing environment in an automated fashion by connecting using
“failover” enabled slots, assuming the publishing group prevents any
changes from propagating to any logical subscriber until all standbys in
the group have been updated. Whether or not the primary-standby group is
resilient in the face of failure during internal group synchronization is
out of the hands of logical subscribers - rather they are only guaranteed
to see a consistent linear history of activity coming out of the publishing
group. Specifically, if the group synchronizes asynchronously there is no
guarantee that every committed transaction on the primary makes its way
through to the logical subscriber if a slot failover happens. But at the
same time its view of the world will be consistent with the newly chosen
primary.

David J.

#7Noname
Masahiro.Ikeda@nttdata.com
In reply to: Amit Kapila (#4)
RE: Doc: fix the note related to the GUC "synchronized_standby_slots"

Thans for your responses.

I think you see such a behavior because you have disabled 'synchronized_standby_slots'
in your script (# disable "synchronized_standby_slots"). You need to enable that to
avoid data loss. Considering that, I don't think your proposed text is an improvement.

Yes, I know.

As David said, "without losing data" makes me confused because there are three patterns that users
think the data was lost though there may be other cases.

Pattern1. the data which clients get a committed response for from the old primary, but the new primary doesn’t have in the case of asynchronous replication
-> we can avoid this with synchronous replication. This is not relevant to the failover feature.

Pattern2. the data which the new primary has, but the subscribers don't have
-> we can avoid this with the failover feature.

Pattern3. the data which the subscribers have, but the new primary doesn't have
-> we can avoid this with the 'synchronized_standby_slots' parameter.

Currently, I understand that the following documentation says
* the failover feature makes publications without losing pattern 2 data.
* pattern 1 data may be lost if you use asynchronous replication.
* the following doesn't mention pattern 3 at all, which I misunderstood point.

They can continue subscribing to publications on the new primary server without losing data.
Note that in the case of asynchronous replication, there remains a risk of data loss for transactions
committed on the former primary server but have yet to be replicated to the new primary server

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Noname (#7)
Re: Doc: fix the note related to the GUC "synchronized_standby_slots"

On Tue, Aug 27, 2024 at 10:18 AM <Masahiro.Ikeda@nttdata.com> wrote:

I think you see such a behavior because you have disabled 'synchronized_standby_slots'
in your script (# disable "synchronized_standby_slots"). You need to enable that to
avoid data loss. Considering that, I don't think your proposed text is an improvement.

Yes, I know.

As David said, "without losing data" makes me confused because there are three patterns that users
think the data was lost though there may be other cases.

So, will it be okay if we just remove ".. without losing data" from
the sentence? Will that avoid the confusion you have?

With Regards,
Amit Kapila.

#9Noname
Masahiro.Ikeda@nttdata.com
In reply to: Amit Kapila (#8)
RE: Doc: fix the note related to the GUC "synchronized_standby_slots"

So, will it be okay if we just remove ".. without losing data" from the sentence? Will that
avoid the confusion you have?

Yes. Additionally, it would be better to add notes about data consistency after failover for example

Note that data consistency after failover can vary depending on the configurations. If
"synchronized_standby_slots" is not configured, there may be data that only the subscribers hold,
even though the new primary does not. Additionally, in the case of asynchronous physical replication,
there remains a risk of data loss for transactions committed on the former primary server
but have yet to be replicated to the new primary server.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: Noname (#9)
Re: Doc: fix the note related to the GUC "synchronized_standby_slots"

On Tue, Aug 27, 2024 at 3:05 PM <Masahiro.Ikeda@nttdata.com> wrote:

So, will it be okay if we just remove ".. without losing data" from the sentence? Will that
avoid the confusion you have?

Yes. Additionally, it would be better to add notes about data consistency after failover for example

Note that data consistency after failover can vary depending on the configurations. If
"synchronized_standby_slots" is not configured, there may be data that only the subscribers hold,
even though the new primary does not.

This part can be inferred from the description of
synchronized_standby_slots [1]https://www.postgresql.org/docs/17/runtime-config-replication.html#GUC-SYNCHRONIZED-STANDBY-SLOTS (See: This guarantees that logical
replication failover slots do not consume changes until those changes
are received and flushed to corresponding physical standbys. If a
logical replication connection is meant to switch to a physical
standby after the standby is promoted, the physical replication slot
for the standby should be listed here.)

Additionally, in the case of asynchronous physical replication,

there remains a risk of data loss for transactions committed on the former primary server
but have yet to be replicated to the new primary server.

This has nothing to do with failover slots. This is a known behavior
of asynchronous replication, so adding here doesn't make much sense.

In general, adding more information unrelated to failover slots can
confuse users.

[1]: https://www.postgresql.org/docs/17/runtime-config-replication.html#GUC-SYNCHRONIZED-STANDBY-SLOTS

--
With Regards,
Amit Kapila.

#11Noname
Masahiro.Ikeda@nttdata.com
In reply to: Amit Kapila (#10)
RE: Doc: fix the note related to the GUC "synchronized_standby_slots"

So, will it be okay if we just remove ".. without losing data" from
the sentence? Will that avoid the confusion you have?

Yes. Additionally, it would be better to add notes about data
consistency after failover for example

Note that data consistency after failover can vary depending on the
configurations. If "synchronized_standby_slots" is not configured,
there may be data that only the subscribers hold, even though the new primary does

not.

This part can be inferred from the description of synchronized_standby_slots [1] (See:
This guarantees that logical replication failover slots do not consume changes until those
changes are received and flushed to corresponding physical standbys. If a logical
replication connection is meant to switch to a physical standby after the standby is
promoted, the physical replication slot for the standby should be listed here.)

OK, it's enough for me just remove ".. without losing data".

Additionally, in the case of asynchronous physical replication,

there remains a risk of data loss for transactions committed on the
former primary server but have yet to be replicated to the new primary server.

This has nothing to do with failover slots. This is a known behavior of asynchronous
replication, so adding here doesn't make much sense.

In general, adding more information unrelated to failover slots can confuse users.

OK, I agreed to remove the sentence.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Noname (#11)
1 attachment(s)
Re: Doc: fix the note related to the GUC "synchronized_standby_slots"

On Wed, Aug 28, 2024 at 6:16 AM <Masahiro.Ikeda@nttdata.com> wrote:

So, will it be okay if we just remove ".. without losing data" from
the sentence? Will that avoid the confusion you have?

Yes. Additionally, it would be better to add notes about data
consistency after failover for example

Note that data consistency after failover can vary depending on the
configurations. If "synchronized_standby_slots" is not configured,
there may be data that only the subscribers hold, even though the new primary does

not.

This part can be inferred from the description of synchronized_standby_slots [1] (See:
This guarantees that logical replication failover slots do not consume changes until those
changes are received and flushed to corresponding physical standbys. If a logical
replication connection is meant to switch to a physical standby after the standby is
promoted, the physical replication slot for the standby should be listed here.)

OK, it's enough for me just remove ".. without losing data".

The next line related to asynchronous replication is also not
required. See attached.

--
With Regards,
Amit Kapila.

Attachments:

fix_doc_1.patchapplication/octet-stream; name=fix_doc_1.patchDownload
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index bee7e02983..94c3ad7376 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -701,10 +701,7 @@ ALTER SUBSCRIPTION
    <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link>
    parameter ensures a seamless transition of those subscriptions after the
    standby is promoted. They can continue subscribing to publications on the
-   new primary server without losing data. Note that in the case of
-   asynchronous replication, there remains a risk of data loss for transactions
-   committed on the former primary server but have yet to be replicated to the new
-   primary server.
+   new primary server.
   </para>
 
   <para>
#13Noname
Masahiro.Ikeda@nttdata.com
In reply to: Amit Kapila (#12)
1 attachment(s)
RE: Doc: fix the note related to the GUC "synchronized_standby_slots"

So, will it be okay if we just remove ".. without losing data"
from the sentence? Will that avoid the confusion you have?

Yes. Additionally, it would be better to add notes about data
consistency after failover for example

Note that data consistency after failover can vary depending on
the configurations. If "synchronized_standby_slots" is not
configured, there may be data that only the subscribers hold, even
though the new primary does

not.

This part can be inferred from the description of synchronized_standby_slots [1]

(See:

This guarantees that logical replication failover slots do not
consume changes until those changes are received and flushed to
corresponding physical standbys. If a logical replication connection
is meant to switch to a physical standby after the standby is
promoted, the physical replication slot for the standby should be
listed here.)

OK, it's enough for me just remove ".. without losing data".

The next line related to asynchronous replication is also not required. See attached.

Thanks, I found another ".. without losing data".

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

fix_doc_2.patchapplication/octet-stream; name=fix_doc_2.patchDownload
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index bee7e02983b..bc095d01c00 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -701,10 +701,7 @@ ALTER SUBSCRIPTION
    <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link>
    parameter ensures a seamless transition of those subscriptions after the
    standby is promoted. They can continue subscribing to publications on the
-   new primary server without losing data. Note that in the case of
-   asynchronous replication, there remains a risk of data loss for transactions
-   committed on the former primary server but have yet to be replicated to the new
-   primary server.
+   new primary server.
   </para>
 
   <para>
@@ -791,7 +788,7 @@ test_standby=# SELECT slot_name, (synced AND NOT temporary AND NOT conflicting)
    If all the slots are present on the standby server and the result
    (<literal>failover_ready</literal>) of the above SQL query is true, then
    existing subscriptions can continue subscribing to publications now on the
-   new primary server without losing data.
+   new primary server.
   </para>
 
  </sect1>
#14Amit Kapila
amit.kapila16@gmail.com
In reply to: Noname (#13)
Re: Doc: fix the note related to the GUC "synchronized_standby_slots"

On Wed, Aug 28, 2024 at 3:02 PM <Masahiro.Ikeda@nttdata.com> wrote:

The next line related to asynchronous replication is also not required. See attached.

Thanks, I found another ".. without losing data".

I'll push this tomorrow unless there are any other suggestions on this patch.

--
With Regards,
Amit Kapila.