Documentation update of wal_retrieve_retry_interval to mention table sync worker

Started by vignesh Cabout 1 year ago7 messages

vignesh21@gmail.com

about 1 year ago

1 attachment(s)

Hi,

Currently, we restart the table synchronization worker after the
duration specified by wal_retrieve_retry_interval following the last
failure. While this behavior is documented for apply workers, it is
not mentioned for table synchronization workers. I believe this detail
should be included in the documentation for table synchronization
workers as well. Attached is a patch to address this omission.

Regards,
Vignesh

Attachments:

doc_update_wal_retrieve_retry_interval_config.patchtext/x-patch; charset=US-ASCII; name=doc_update_wal_retrieve_retry_interval_config.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fbdd6ce574..93ad17c529 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5094,7 +5094,8 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
        </para>
        <para>
         In logical replication, this parameter also limits how often a failing
-        replication apply worker will be respawned.
+        replication apply worker, and table synchronization worker will be
+        respawned.
        </para>
       </listitem>
      </varlistentry>

Peter Smith

smithpb2250@gmail.com

about 1 year ago

In reply to: vignesh C (#1)

Re: Documentation update of wal_retrieve_retry_interval to mention table sync worker

On Thu, Dec 26, 2024 at 1:37 AM vignesh C <vignesh21@gmail.com> wrote:

Hi,

Currently, we restart the table synchronization worker after the
duration specified by wal_retrieve_retry_interval following the last
failure. While this behavior is documented for apply workers, it is
not mentioned for table synchronization workers. I believe this detail
should be included in the documentation for table synchronization
workers as well. Attached is a patch to address this omission.

Regards,
Vignesh

Hi Vignesh,

Here are some review comments for your v1 patch.

+1 to enhance the documentation.

======

1.
        <para>
         In logical replication, this parameter also limits how often a failing
-        replication apply worker will be respawned.
+        replication apply worker, and table synchronization worker will be
+        respawned.
        </para>

/, and/or/

SUGGESTION
In logical replication, this parameter also limits how often a failing
replication apply worker or table synchronization worker will be
respawned.

======

2.
I think the reader might never be aware of any of this (throttled
relaunch) behaviour unless they accidentally stumble across the docs
for this GUC, so IMO this information should be mentioned elsewhere --
wherever the tablesync worker errors are documented. But, TBH, I can't
find anywhere in the PostgreSQL docs where it even mentions
re-launching failed tablesync workers!

Anyway, I think it might be good to include such information in some
suitable place (maybe in the CREATE SUBSCRIPTION notes? or maybe in
Chapter 29?) to say something like...

SUGGESTION:
In practice, if a table synchronization worker fails during logical
replication, the apply worker detects the failure and attempts to
respawn the table synchronization worker to continue the
synchronization process. This behaviour ensures that transient errors
do not permanently disrupt the replication setup. See also
wal_retrieve_retry_interval.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

vignesh C

vignesh21@gmail.com

about 1 year ago

In reply to: Peter Smith (#2)

1 attachment(s)

Re: Documentation update of wal_retrieve_retry_interval to mention table sync worker

On Tue, 31 Dec 2024 at 02:48, Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Dec 26, 2024 at 1:37 AM vignesh C <vignesh21@gmail.com> wrote:

Hi,

Currently, we restart the table synchronization worker after the
duration specified by wal_retrieve_retry_interval following the last
failure. While this behavior is documented for apply workers, it is
not mentioned for table synchronization workers. I believe this detail
should be included in the documentation for table synchronization
workers as well. Attached is a patch to address this omission.

Regards,
Vignesh

Hi Vignesh,

Here are some review comments for your v1 patch.

+1 to enhance the documentation.

======
1.
<para>
In logical replication, this parameter also limits how often a failing
-        replication apply worker will be respawned.
+        replication apply worker, and table synchronization worker will be
+        respawned.
</para>
/, and/or/

SUGGESTION
In logical replication, this parameter also limits how often a failing
replication apply worker or table synchronization worker will be
respawned.

Modified

======

2.
I think the reader might never be aware of any of this (throttled
relaunch) behaviour unless they accidentally stumble across the docs
for this GUC, so IMO this information should be mentioned elsewhere --
wherever the tablesync worker errors are documented. But, TBH, I can't
find anywhere in the PostgreSQL docs where it even mentions
re-launching failed tablesync workers!

Anyway, I think it might be good to include such information in some
suitable place (maybe in the CREATE SUBSCRIPTION notes? or maybe in
Chapter 29?) to say something like...

SUGGESTION:
In practice, if a table synchronization worker fails during logical
replication, the apply worker detects the failure and attempts to
respawn the table synchronization worker to continue the
synchronization process. This behaviour ensures that transient errors
do not permanently disrupt the replication setup. See also
wal_retrieve_retry_interval.

Yes, adding it to logical replication Initial Snapshot seemed more
appropriate to me.

The attached v2 version patch has the changes for the same.

Regards,
Vignesh

Attachments:

v2_doc_update_wal_retrieve_retry_interval_config.patchtext/x-patch; charset=US-ASCII; name=v2_doc_update_wal_retrieve_retry_interval_config.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fbdd6ce574..b58c7f25f7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5094,7 +5094,8 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
        </para>
        <para>
         In logical replication, this parameter also limits how often a failing
-        replication apply worker will be respawned.
+        replication apply worker or table synchronization worker will be
+        respawned.
        </para>
       </listitem>
      </varlistentry>
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 8290cd1a08..925e0dd101 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -1993,18 +1993,17 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
     <title>Initial Snapshot</title>
     <para>
      The initial data in existing subscribed tables are snapshotted and
-     copied in a parallel instance of a special kind of apply process.
-     This process will create its own replication slot and copy the existing
-     data.  As soon as the copy is finished the table contents will become
-     visible to other backends.  Once existing data is copied, the worker
-     enters synchronization mode, which ensures that the table is brought
-     up to a synchronized state with the main apply process by streaming
-     any changes that happened during the initial data copy using standard
-     logical replication.  During this synchronization phase, the changes
-     are applied and committed in the same order as they happened on the
-     publisher.  Once synchronization is done, control of the
-     replication of the table is given back to the main apply process where
-     replication continues as normal.
+     copied in a parallel instance of a special kind of table synchronization
+     worker process. This process will create its own replication slot and copy
+     the existing data.  As soon as the copy is finished the table contents will
+     become visible to other backends.  Once existing data is copied, the worker
+     enters synchronization mode, which ensures that the table is brought up to
+     a synchronized state with the main apply process by streaming any changes
+     that happened during the initial data copy using standard logical
+     replication.  During this synchronization phase, the changes are applied
+     and committed in the same order as they happened on the publisher.  Once
+     synchronization is done, control of the replication of the table is given
+     back to the main apply process where replication continues as normal.
     </para>
     <note>
      <para>
@@ -2015,6 +2014,15 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
       when copying the existing table data.
      </para>
     </note>
+    <note>
+     <para>
+      If a table synchronization worker fails during copy, the apply worker
+      detects the failure and respawns the table synchronization worker to
+      continue the synchronization process. This behaviour ensures that
+      transient errors do not permanently disrupt the replication setup. See
+      also <link linkend="guc-wal-retrieve-retry-interval"><varname>wal_retrieve_retry_interval</varname></link>.
+     </para>
+    </note>
   </sect2>
  </sect1>

Peter Smith

smithpb2250@gmail.com

about 1 year ago

In reply to: vignesh C (#3)

1 attachment(s)

Re: Documentation update of wal_retrieve_retry_interval to mention table sync worker

Hi Vignesh,

Some review comments for your v2 patch.

======
doc/src/sgml/logical-replication.sgml

1.
     <para>
      The initial data in existing subscribed tables are snapshotted and
-     copied in a parallel instance of a special kind of apply process.
-     This process will create its own replication slot and copy the existing
-     data.  As soon as the copy is finished the table contents will become
-     visible to other backends.  Once existing data is copied, the worker
-     enters synchronization mode, which ensures that the table is brought
-     up to a synchronized state with the main apply process by streaming
-     any changes that happened during the initial data copy using standard
-     logical replication.  During this synchronization phase, the changes
-     are applied and committed in the same order as they happened on the
-     publisher.  Once synchronization is done, control of the
-     replication of the table is given back to the main apply process where
-     replication continues as normal.
+     copied in a parallel instance of a special kind of table synchronization
+     worker process. This process will create its own replication slot and copy
+     the existing data.  As soon as the copy is finished the table
contents will
+     become visible to other backends.  Once existing data is copied,
the worker
+     enters synchronization mode, which ensures that the table is brought up to
+     a synchronized state with the main apply process by streaming any changes
+     that happened during the initial data copy using standard logical
+     replication.  During this synchronization phase, the changes are applied
+     and committed in the same order as they happened on the publisher.  Once
+     synchronization is done, control of the replication of the table is given
+     back to the main apply process where replication continues as normal.
     </para>

AFAICT the only difference you made is changing:
FROM "a special kind of apply process"
TO "a special kind of table synchronization worker process".

There is only ONE kind of tablesync process, so I think saying "a
special kind of table synchronization worker process" seems
misleading. I also thought maybe it is better to mention that this is
PER table.

SUGGESTION:
... a special table synchronization worker process per table.

(e.g. please see attached diff)

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

PS_NITPICKS_V2.txttext/plain; charset=US-ASCII; name=PS_NITPICKS_V2.txtDownload

diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 925e0dd..3b01694 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -1993,8 +1993,8 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
     <title>Initial Snapshot</title>
     <para>
      The initial data in existing subscribed tables are snapshotted and
-     copied in a parallel instance of a special kind of table synchronization
-     worker process. This process will create its own replication slot and copy
+     copied in a parallel instance of a special table synchronization
+     worker process per table. This process will create its own replication slot and copy
      the existing data.  As soon as the copy is finished the table contents will
      become visible to other backends.  Once existing data is copied, the worker
      enters synchronization mode, which ensures that the table is brought up to

vignesh C

vignesh21@gmail.com

about 1 year ago

In reply to: Peter Smith (#4)

1 attachment(s)

Re: Documentation update of wal_retrieve_retry_interval to mention table sync worker

On Mon, 6 Jan 2025 at 08:47, Peter Smith <smithpb2250@gmail.com> wrote:

Hi Vignesh,

Some review comments for your v2 patch.

======
doc/src/sgml/logical-replication.sgml

AFAICT the only difference you made is changing:
FROM "a special kind of apply process"
TO "a special kind of table synchronization worker process".

There is only ONE kind of tablesync process, so I think saying "a
special kind of table synchronization worker process" seems
misleading. I also thought maybe it is better to mention that this is
PER table.

SUGGESTION:
... a special table synchronization worker process per table.

Thanks, the updated v3 version patch has the changes for the same.

Regards,
Vignesh

Attachments:

v3-0001-Improve-documentation-on-table-synchronization-wo.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Improve-documentation-on-table-synchronization-wo.patchDownload

From 292f8d0249b2b1772d2cff6ae1208954eb188b7f Mon Sep 17 00:00:00 2001
From: Vignesh <vignesh21@gmail.com>
Date: Mon, 13 Jan 2025 12:27:55 +0530
Subject: [PATCH v3] Improve documentation on table synchronization worker
 process

Update the documentation to explain the process of initial data
replication using the table synchronization worker. Also, clarify
how the worker process is automatically respawned in the event of
failure, based on the configuration of the wal_retrieve_retry_interval
parameter guc.
---
 doc/src/sgml/config.sgml              |  3 ++-
 doc/src/sgml/logical-replication.sgml | 33 +++++++++++++++++----------
 2 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3f41a17b1f..e063598879 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4953,7 +4953,8 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
        </para>
        <para>
         In logical replication, this parameter also limits how often a failing
-        replication apply worker will be respawned.
+        replication apply worker or table synchronization worker will be
+        respawned.
        </para>
       </listitem>
      </varlistentry>
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 8290cd1a08..86aea348f1 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -1993,18 +1993,18 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
     <title>Initial Snapshot</title>
     <para>
      The initial data in existing subscribed tables are snapshotted and
-     copied in a parallel instance of a special kind of apply process.
-     This process will create its own replication slot and copy the existing
-     data.  As soon as the copy is finished the table contents will become
-     visible to other backends.  Once existing data is copied, the worker
-     enters synchronization mode, which ensures that the table is brought
-     up to a synchronized state with the main apply process by streaming
-     any changes that happened during the initial data copy using standard
-     logical replication.  During this synchronization phase, the changes
-     are applied and committed in the same order as they happened on the
-     publisher.  Once synchronization is done, control of the
-     replication of the table is given back to the main apply process where
-     replication continues as normal.
+     copied in a parallel instance of a special table synchronization
+     worker process per table. This process will create its own replication
+     slot and copy the existing data.  As soon as the copy is finished the
+     table contents will become visible to other backends.  Once existing data
+     is copied, the worker enters synchronization mode, which ensures that the
+     table is brought up to a synchronized state with the main apply process by
+     streaming any changes that happened during the initial data copy using
+     standard logical replication.  During this synchronization phase, the
+     changes are applied and committed in the same order as they happened on
+     the publisher.  Once synchronization is done, control of the replication
+     of the table is given back to the main apply process where replication
+     continues as normal.
     </para>
     <note>
      <para>
@@ -2015,6 +2015,15 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
       when copying the existing table data.
      </para>
     </note>
+    <note>
+     <para>
+      If a table synchronization worker fails during copy, the apply worker
+      detects the failure and respawns the table synchronization worker to
+      continue the synchronization process. This behaviour ensures that
+      transient errors do not permanently disrupt the replication setup. See
+      also <link linkend="guc-wal-retrieve-retry-interval"><varname>wal_retrieve_retry_interval</varname></link>.
+     </para>
+    </note>
   </sect2>
  </sect1>
 
-- 
2.43.0

Peter Smith

smithpb2250@gmail.com

about 1 year ago

In reply to: vignesh C (#5)

Re: Documentation update of wal_retrieve_retry_interval to mention table sync worker

Patch v3-0001 LGTM

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Shlok Kyal

shlok.kyal.oss@gmail.com

about 1 year ago

In reply to: vignesh C (#5)

Re: Documentation update of wal_retrieve_retry_interval to mention table sync worker

On Mon, 13 Jan 2025 at 12:33, vignesh C <vignesh21@gmail.com> wrote:

On Mon, 6 Jan 2025 at 08:47, Peter Smith <smithpb2250@gmail.com> wrote:

Hi Vignesh,

Some review comments for your v2 patch.

======
doc/src/sgml/logical-replication.sgml

AFAICT the only difference you made is changing:
FROM "a special kind of apply process"
TO "a special kind of table synchronization worker process".

There is only ONE kind of tablesync process, so I think saying "a
special kind of table synchronization worker process" seems
misleading. I also thought maybe it is better to mention that this is
PER table.

SUGGESTION:
... a special table synchronization worker process per table.

Thanks, the updated v3 version patch has the changes for the same.

Hi Vignesh,

I reviewed the v3 patch. And it looks good to me.

Thanks and Regards,
Shlok Kyal