Improve pg_sync_replication_slots() to wait for primary to advance
Hello,
Creating this thread for a POC based on discussions in thread [1]/messages/by-id/CAF1DzPWTcg+m+x+oVVB=y4q9=PYYsL_mujVp7uJr-_oUtWNGbA@mail.gmail.com.
Hou-san had created this patch, and I just cleaned up some documents,
did some testing and now sharing the patch here.
In this patch, the pg_sync_replication_slots() API now waits
indefinitely for the remote slot to catch up. We could later add a
timeout parameter to control maximum wait time if this approach seems
acceptable. If there are more ideas on improving this patch, let me
know.
regards,
Ajin Cherian
[1]: /messages/by-id/CAF1DzPWTcg+m+x+oVVB=y4q9=PYYsL_mujVp7uJr-_oUtWNGbA@mail.gmail.com
Attachments:
0001-Fix-stale-snapshot-issue.patchapplication/octet-stream; name=0001-Fix-stale-snapshot-issue.patchDownload+13-1
Sorry, I attached the wrong file. Attaching the correct file now.
regards,
Ajin Cerian
Fujitsu Australia
Attachments:
0001-Improve-initial-slot-synchronization-in-pg_sync_repl.patchapplication/octet-stream; name=0001-Improve-initial-slot-synchronization-in-pg_sync_repl.patchDownload+180-30
On Tue, Jun 24, 2025 at 4:11 PM Ajin Cherian <itsajin@gmail.com> wrote:
Hello,
Creating this thread for a POC based on discussions in thread [1].
Hou-san had created this patch, and I just cleaned up some documents,
did some testing and now sharing the patch here.In this patch, the pg_sync_replication_slots() API now waits
indefinitely for the remote slot to catch up. We could later add a
timeout parameter to control maximum wait time if this approach seems
acceptable. If there are more ideas on improving this patch, let me
know.
+1 on the idea.
I believe the timeout option may not be necessary here, since the API
can be manually canceled if needed. Otherwise, the recommended
approach is to let it complete. But I would like to know what others
think here.
Few comments:
1)
When the API is waiting for the primary to advance, standby fails to
handle promotion requests. Promotion fails:
./pg_ctl -D ../../standbydb/ promote -w
waiting for server to promote.................stopped waiting
pg_ctl: server did not promote in time
See the logs at [1]Log file: 2025-07-02 14:38:09.851 IST [153187] LOG: waiting for remote slot "failover_slot" LSN (0/3003F60) and catalog xmin (754) to pass local slot LSN (0/3003F60) and catalog xmin (767) 2025-07-02 14:38:09.851 IST [153187] STATEMENT: SELECT pg_sync_replication_slots(); 2025-07-02 14:41:36.200 IST [153164] LOG: received promote request
2)
Also when the API is waiting for a long time, it just dumps the
'waiting for remote_slot..' LOG only once. Do you think it makes sense
to log it at a regular interval until the wait is over? See logs at
[1]: Log file: 2025-07-02 14:38:09.851 IST [153187] LOG: waiting for remote slot "failover_slot" LSN (0/3003F60) and catalog xmin (754) to pass local slot LSN (0/3003F60) and catalog xmin (767) 2025-07-02 14:38:09.851 IST [153187] STATEMENT: SELECT pg_sync_replication_slots(); 2025-07-02 14:41:36.200 IST [153164] LOG: received promote request
3)
+ /*
+ * It is possible to get null value for restart_lsn if the slot is
+ * invalidated on the primary server, so handle accordingly.
+ */
+ if (new_invalidated || XLogRecPtrIsInvalid(new_restart_lsn))
+ {
+ /*
+ * The slot won't be persisted by the caller; it will be cleaned up
+ * at the end of synchronization.
+ */
+ ereport(WARNING,
+ errmsg("aborting initial sync for slot \"%s\"",
+ remote_slot->name),
+ errdetail("This slot was invalidated on the primary server."));
Which case are we referring to here where null restart_lsn would mean
invalidation? Can you please point me to such code where it happens or
a test-case which does that. I tried a few invalidation cases, but did
not hit it.
[1]: Log file: 2025-07-02 14:38:09.851 IST [153187] LOG: waiting for remote slot "failover_slot" LSN (0/3003F60) and catalog xmin (754) to pass local slot LSN (0/3003F60) and catalog xmin (767) 2025-07-02 14:38:09.851 IST [153187] STATEMENT: SELECT pg_sync_replication_slots(); 2025-07-02 14:41:36.200 IST [153164] LOG: received promote request
Log file:
2025-07-02 14:38:09.851 IST [153187] LOG: waiting for remote slot
"failover_slot" LSN (0/3003F60) and catalog xmin (754) to pass local
slot LSN (0/3003F60) and catalog xmin (767)
2025-07-02 14:38:09.851 IST [153187] STATEMENT: SELECT
pg_sync_replication_slots();
2025-07-02 14:41:36.200 IST [153164] LOG: received promote request
thanks
Shveta
Please find few more comments:
1)
In pg_sync_replication_slots() doc, we have this:
"Note that this function is primarily intended for testing and
debugging purposes and should be used with caution. Additionally, this
function cannot be executed if ...."
We can get rid of this info as well and change to:
"Note that this function cannot be executed if...."
2)
We got rid of NOTE in logicaldecoding.sgml, but now the page does not
mention pg_sync_replication_slots() at all. We need to bring back the
change removed by [1]/messages/by-id/CAJpy0uAD_La2vi+B+iSBbCYTMayMstvbF9ndrAJysL9t5fHtbQ@mail.gmail.com (or something on similar line) which is this:
- <command>CREATE SUBSCRIPTION</command> during slot creation, and
then calling
- <link linkend="pg-sync-replication-slots">
- <function>pg_sync_replication_slots</function></link>
- on the standby. By setting <link linkend="guc-sync-replication-slots">
+ <command>CREATE SUBSCRIPTION</command> during slot creation.
+ Additionally, enabling <link linkend="guc-sync-replication-slots">
+ <varname>sync_replication_slots</varname></link> on the standby
+ is required. By enabling <link linkend="guc-sync-replication-slots">
3)
wait_for_primary_slot_catchup():
+ /*
+ * It is possible to get null values for confirmed_lsn and
+ * catalog_xmin if on the primary server the slot is just created with
+ * a valid restart_lsn and slot-sync worker has fetched the slot
+ * before the primary server could set valid confirmed_lsn and
+ * catalog_xmin.
+ */
Do we need this special handling? We already have one such handling in
synchronize_slots(). please see:
/*
* If restart_lsn, confirmed_lsn or catalog_xmin is
invalid but the
* slot is valid, that means we have fetched the
remote_slot in its
* RS_EPHEMERAL state. In such a case, don't sync it;
we can always
* sync it in the next sync cycle when the remote_slot
is persisted
* and has valid lsn(s) and xmin values.
*/
if ((XLogRecPtrIsInvalid(remote_slot->restart_lsn) ||
XLogRecPtrIsInvalid(remote_slot->confirmed_lsn) ||
!TransactionIdIsValid(remote_slot->catalog_xmin)) &&
remote_slot->invalidated == RS_INVAL_NONE)
pfree(remote_slot);
Due to the above check in synchronize_slots(), we will not reach
wait_for_primary_slot_catchup() when any of confirmed_lsn or
catalog_xmin is not initialized.
[1]: /messages/by-id/CAJpy0uAD_La2vi+B+iSBbCYTMayMstvbF9ndrAJysL9t5fHtbQ@mail.gmail.com
thanks
Shveta
On Wed, Jul 2, 2025 at 7:56 PM shveta malik <shveta.malik@gmail.com> wrote:
Few comments:
1)
When the API is waiting for the primary to advance, standby fails to
handle promotion requests. Promotion fails:
./pg_ctl -D ../../standbydb/ promote -w
waiting for server to promote.................stopped waiting
pg_ctl: server did not promote in timeSee the logs at [1]
I've modified this to handle promotion request and stop slot
synchronization if standby is promoted.
2)
Also when the API is waiting for a long time, it just dumps the
'waiting for remote_slot..' LOG only once. Do you think it makes sense
to log it at a regular interval until the wait is over? See logs at
[1]. It dumped the log once in 3minutes.
I've modified it to log once every 10 seconds.
3) + /* + * It is possible to get null value for restart_lsn if the slot is + * invalidated on the primary server, so handle accordingly. + */+ if (new_invalidated || XLogRecPtrIsInvalid(new_restart_lsn)) + { + /* + * The slot won't be persisted by the caller; it will be cleaned up + * at the end of synchronization. + */ + ereport(WARNING, + errmsg("aborting initial sync for slot \"%s\"", + remote_slot->name), + errdetail("This slot was invalidated on the primary server."));Which case are we referring to here where null restart_lsn would mean
invalidation? Can you please point me to such code where it happens or
a test-case which does that. I tried a few invalidation cases, but did
not hit it.
I've removed all this code as the checks for null restart_lsn and
other parameters are handled in earlier functions and we won't get
here if these had null, I've added asserts to check that it is not
null.
On Wed, Jul 9, 2025 at 2:53 PM shveta malik <shveta.malik@gmail.com> wrote:
Please find few more comments:
1)
In pg_sync_replication_slots() doc, we have this:"Note that this function is primarily intended for testing and
debugging purposes and should be used with caution. Additionally, this
function cannot be executed if ...."We can get rid of this info as well and change to:
"Note that this function cannot be executed if...."
Modified as requested.
2)
We got rid of NOTE in logicaldecoding.sgml, but now the page does not
mention pg_sync_replication_slots() at all. We need to bring back the
change removed by [1] (or something on similar line) which is this:- <command>CREATE SUBSCRIPTION</command> during slot creation, and then calling - <link linkend="pg-sync-replication-slots"> - <function>pg_sync_replication_slots</function></link> - on the standby. By setting <link linkend="guc-sync-replication-slots"> + <command>CREATE SUBSCRIPTION</command> during slot creation. + Additionally, enabling <link linkend="guc-sync-replication-slots"> + <varname>sync_replication_slots</varname></link> on the standby + is required. By enabling <link linkend="guc-sync-replication-slots">
I've added that back.
3) wait_for_primary_slot_catchup(): + /* + * It is possible to get null values for confirmed_lsn and + * catalog_xmin if on the primary server the slot is just created with + * a valid restart_lsn and slot-sync worker has fetched the slot + * before the primary server could set valid confirmed_lsn and + * catalog_xmin. + */Do we need this special handling? We already have one such handling in
synchronize_slots(). please see:
/*
* If restart_lsn, confirmed_lsn or catalog_xmin is
invalid but the
* slot is valid, that means we have fetched the
remote_slot in its
* RS_EPHEMERAL state. In such a case, don't sync it;
we can always
* sync it in the next sync cycle when the remote_slot
is persisted
* and has valid lsn(s) and xmin values.
*/
if ((XLogRecPtrIsInvalid(remote_slot->restart_lsn) ||
XLogRecPtrIsInvalid(remote_slot->confirmed_lsn) ||
!TransactionIdIsValid(remote_slot->catalog_xmin)) &&
remote_slot->invalidated == RS_INVAL_NONE)
pfree(remote_slot);Due to the above check in synchronize_slots(), we will not reach
wait_for_primary_slot_catchup() when any of confirmed_lsn or
catalog_xmin is not initialized.
Yes, you are correct. I've removed all that logic.
The modified patch (v2) is attached.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
v2-0001-Improve-initial-slot-synchronization-in-pg_sync_r.patchapplication/octet-stream; name=v2-0001-Improve-initial-slot-synchronization-in-pg_sync_r.patchDownload+65-58
On Wed, Jul 16, 2025 at 3:00 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Jul 2, 2025 at 7:56 PM shveta malik <shveta.malik@gmail.com> wrote:
Few comments:
1)
When the API is waiting for the primary to advance, standby fails to
handle promotion requests. Promotion fails:
./pg_ctl -D ../../standbydb/ promote -w
waiting for server to promote.................stopped waiting
pg_ctl: server did not promote in timeSee the logs at [1]
I've modified this to handle promotion request and stop slot
synchronization if standby is promoted.2)
Also when the API is waiting for a long time, it just dumps the
'waiting for remote_slot..' LOG only once. Do you think it makes sense
to log it at a regular interval until the wait is over? See logs at
[1]. It dumped the log once in 3minutes.I've modified it to log once every 10 seconds.
3) + /* + * It is possible to get null value for restart_lsn if the slot is + * invalidated on the primary server, so handle accordingly. + */+ if (new_invalidated || XLogRecPtrIsInvalid(new_restart_lsn)) + { + /* + * The slot won't be persisted by the caller; it will be cleaned up + * at the end of synchronization. + */ + ereport(WARNING, + errmsg("aborting initial sync for slot \"%s\"", + remote_slot->name), + errdetail("This slot was invalidated on the primary server."));Which case are we referring to here where null restart_lsn would mean
invalidation? Can you please point me to such code where it happens or
a test-case which does that. I tried a few invalidation cases, but did
not hit it.I've removed all this code as the checks for null restart_lsn and
other parameters are handled in earlier functions and we won't get
here if these had null, I've added asserts to check that it is not
null.On Wed, Jul 9, 2025 at 2:53 PM shveta malik <shveta.malik@gmail.com> wrote:
Please find few more comments:
1)
In pg_sync_replication_slots() doc, we have this:"Note that this function is primarily intended for testing and
debugging purposes and should be used with caution. Additionally, this
function cannot be executed if ...."We can get rid of this info as well and change to:
"Note that this function cannot be executed if...."
Modified as requested.
2)
We got rid of NOTE in logicaldecoding.sgml, but now the page does not
mention pg_sync_replication_slots() at all. We need to bring back the
change removed by [1] (or something on similar line) which is this:- <command>CREATE SUBSCRIPTION</command> during slot creation, and then calling - <link linkend="pg-sync-replication-slots"> - <function>pg_sync_replication_slots</function></link> - on the standby. By setting <link linkend="guc-sync-replication-slots"> + <command>CREATE SUBSCRIPTION</command> during slot creation. + Additionally, enabling <link linkend="guc-sync-replication-slots"> + <varname>sync_replication_slots</varname></link> on the standby + is required. By enabling <link linkend="guc-sync-replication-slots">I've added that back.
3) wait_for_primary_slot_catchup(): + /* + * It is possible to get null values for confirmed_lsn and + * catalog_xmin if on the primary server the slot is just created with + * a valid restart_lsn and slot-sync worker has fetched the slot + * before the primary server could set valid confirmed_lsn and + * catalog_xmin. + */Do we need this special handling? We already have one such handling in
synchronize_slots(). please see:
/*
* If restart_lsn, confirmed_lsn or catalog_xmin is
invalid but the
* slot is valid, that means we have fetched the
remote_slot in its
* RS_EPHEMERAL state. In such a case, don't sync it;
we can always
* sync it in the next sync cycle when the remote_slot
is persisted
* and has valid lsn(s) and xmin values.
*/
if ((XLogRecPtrIsInvalid(remote_slot->restart_lsn) ||
XLogRecPtrIsInvalid(remote_slot->confirmed_lsn) ||
!TransactionIdIsValid(remote_slot->catalog_xmin)) &&
remote_slot->invalidated == RS_INVAL_NONE)
pfree(remote_slot);Due to the above check in synchronize_slots(), we will not reach
wait_for_primary_slot_catchup() when any of confirmed_lsn or
catalog_xmin is not initialized.Yes, you are correct. I've removed all that logic.
The modified patch (v2) is attached.
I am not able to apply the patch to the latest head or even to a week
back version. Can you please check and rebase?
thanks
Shveta
I am not able to apply the patch to the latest head or even to a week
back version. Can you please check and rebase?thanks
Shveta
Rebased.
Regards,
Ajin Cherian
Fujitsu Australia.
Attachments:
v2-0001-Improve-initial-slot-synchronization-in-pg_sync_r.patchapplication/octet-stream; name=v2-0001-Improve-initial-slot-synchronization-in-pg_sync_r.patchDownload+192-34
On Wed, Jul 16, 2025 at 3:47 PM Ajin Cherian <itsajin@gmail.com> wrote:
I am not able to apply the patch to the latest head or even to a week
back version. Can you please check and rebase?thanks
ShvetaRebased.
Thanks. Please find a few comments:
1)
/* Any slot with NULL in these fields should not have made it this far */
It is good to get rid of the case where we had checks for NULL
confirmed_lsn and catalog_xmin (i.e. when slot was in RS_EPHEMERAL
state), as that has already been checked by synchronize_slots() and
such a slot will not even reach wait_for_primary_slot_catchup(). But a
slot can still be invalidated on primary anytime, and thus during this
wait, we should check for primary's invalidation as we were doing in
v1.
2)
+ * If in SQL API synchronization, and we've been promoted, then no point
extra space before promoted.
3)
+ if (!AmLogicalSlotSyncWorkerProcess() && PromoteIsTriggered())
We don't need 'AmLogicalSlotSyncWorkerProcess' as that is already
checked at the beginning of this function.
4)
+ ereport(WARNING,
+ errmsg("aborting sync for slot \"%s\"",
+ remote_slot->name),
+ errdetail("Promotion occurred before this slot was fully"
+ " synchronized."));
+ pfree(cmd.data);
+
+ return false;
a) Please add an error-code.
b) Shall we change msg to
errmsg("aborting sync for slot \"%s\"",
remote_slot->name),
errhint("%s cannot be executed once promotion is
triggered.",
"pg_sync_replication_slots()")));
5)
Instead of using PromoteIsTriggered, shall we rely on
'SlotSyncCtx->stopSignaled' as we do when we start this API.
6)
In logicaldecoding.sgml, we can get rid of "Additionally, enabling
sync_replication_slots on the standby is required" to make it same as
what we had prior to the patch I pointed earlier.
Or better we can refine it to below. Thoughts?
The logical replication slots on the primary can be enabled for
synchronization to the hot standby by using the failover parameter of
pg_create_logical_replication_slot, or by using the failover option of
CREATE SUBSCRIPTION during slot creation. After that, synchronization
can be performed either manually by calling pg_sync_replication_slots
on the standby, or automatically by enabling sync_replication_slots on
the standby. When sync_replication_slots is enabled, the failover
slots are periodically synchronized by the slot sync worker. For the
synchronization to work, .....
thanks
Shveta
On Thu, Jul 17, 2025 at 9:34 AM shveta malik <shveta.malik@gmail.com> wrote:
On Wed, Jul 16, 2025 at 3:47 PM Ajin Cherian <itsajin@gmail.com> wrote:
I am not able to apply the patch to the latest head or even to a week
back version. Can you please check and rebase?thanks
ShvetaRebased.
Thanks. Please find a few comments:
1)
/* Any slot with NULL in these fields should not have made it this far */It is good to get rid of the case where we had checks for NULL
confirmed_lsn and catalog_xmin (i.e. when slot was in RS_EPHEMERAL
state), as that has already been checked by synchronize_slots() and
such a slot will not even reach wait_for_primary_slot_catchup(). But a
slot can still be invalidated on primary anytime, and thus during this
wait, we should check for primary's invalidation as we were doing in
v1.2)
+ * If in SQL API synchronization, and we've been promoted, then no pointextra space before promoted.
3)
+ if (!AmLogicalSlotSyncWorkerProcess() && PromoteIsTriggered())
We don't need 'AmLogicalSlotSyncWorkerProcess' as that is already
checked at the beginning of this function.4) + ereport(WARNING, + errmsg("aborting sync for slot \"%s\"", + remote_slot->name), + errdetail("Promotion occurred before this slot was fully" + " synchronized.")); + pfree(cmd.data); + + return false;a) Please add an error-code.
b) Shall we change msg to
errmsg("aborting sync for slot \"%s\"",
remote_slot->name),
errhint("%s cannot be executed once promotion is
triggered.","pg_sync_replication_slots()")));
5)
Instead of using PromoteIsTriggered, shall we rely on
'SlotSyncCtx->stopSignaled' as we do when we start this API.6)
In logicaldecoding.sgml, we can get rid of "Additionally, enabling
sync_replication_slots on the standby is required" to make it same as
what we had prior to the patch I pointed earlier.Or better we can refine it to below. Thoughts?
The logical replication slots on the primary can be enabled for
synchronization to the hot standby by using the failover parameter of
pg_create_logical_replication_slot, or by using the failover option of
CREATE SUBSCRIPTION during slot creation. After that, synchronization
can be performed either manually by calling pg_sync_replication_slots
on the standby, or automatically by enabling sync_replication_slots on
the standby. When sync_replication_slots is enabled, the failover
slots are periodically synchronized by the slot sync worker. For the
synchronization to work, .....
I am wondering if we should provide an optional parameter to
pg_sync_replication_slots(), to control whether to wait for the slot
to be synced or just return with ERROR as it is doing now, default can
be wait.
--
Regards,
Dilip Kumar
Google
On Fri, Jul 18, 2025 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Jul 17, 2025 at 9:34 AM shveta malik <shveta.malik@gmail.com> wrote:
On Wed, Jul 16, 2025 at 3:47 PM Ajin Cherian <itsajin@gmail.com> wrote:
I am not able to apply the patch to the latest head or even to a week
back version. Can you please check and rebase?thanks
ShvetaRebased.
Thanks. Please find a few comments:
1)
/* Any slot with NULL in these fields should not have made it this far */It is good to get rid of the case where we had checks for NULL
confirmed_lsn and catalog_xmin (i.e. when slot was in RS_EPHEMERAL
state), as that has already been checked by synchronize_slots() and
such a slot will not even reach wait_for_primary_slot_catchup(). But a
slot can still be invalidated on primary anytime, and thus during this
wait, we should check for primary's invalidation as we were doing in
v1.2)
+ * If in SQL API synchronization, and we've been promoted, then no pointextra space before promoted.
3)
+ if (!AmLogicalSlotSyncWorkerProcess() && PromoteIsTriggered())
We don't need 'AmLogicalSlotSyncWorkerProcess' as that is already
checked at the beginning of this function.4) + ereport(WARNING, + errmsg("aborting sync for slot \"%s\"", + remote_slot->name), + errdetail("Promotion occurred before this slot was fully" + " synchronized.")); + pfree(cmd.data); + + return false;a) Please add an error-code.
b) Shall we change msg to
errmsg("aborting sync for slot \"%s\"",
remote_slot->name),
errhint("%s cannot be executed once promotion is
triggered.","pg_sync_replication_slots()")));
5)
Instead of using PromoteIsTriggered, shall we rely on
'SlotSyncCtx->stopSignaled' as we do when we start this API.6)
In logicaldecoding.sgml, we can get rid of "Additionally, enabling
sync_replication_slots on the standby is required" to make it same as
what we had prior to the patch I pointed earlier.Or better we can refine it to below. Thoughts?
The logical replication slots on the primary can be enabled for
synchronization to the hot standby by using the failover parameter of
pg_create_logical_replication_slot, or by using the failover option of
CREATE SUBSCRIPTION during slot creation. After that, synchronization
can be performed either manually by calling pg_sync_replication_slots
on the standby, or automatically by enabling sync_replication_slots on
the standby. When sync_replication_slots is enabled, the failover
slots are periodically synchronized by the slot sync worker. For the
synchronization to work, .....I am wondering if we should provide an optional parameter to
pg_sync_replication_slots(), to control whether to wait for the slot
to be synced or just return with ERROR as it is doing now, default can
be wait.
Do you mean specifically in case of promotion or in general, as in do
not wait for primary to catch-up (anytime) and exit and drop the
temporary slot while exiting?
thanks
Shveta
On Fri, Jul 18, 2025 at 10:45 AM shveta malik <shveta.malik@gmail.com> wrote:
On Fri, Jul 18, 2025 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Jul 17, 2025 at 9:34 AM shveta malik <shveta.malik@gmail.com> wrote:
On Wed, Jul 16, 2025 at 3:47 PM Ajin Cherian <itsajin@gmail.com> wrote:
I am not able to apply the patch to the latest head or even to a week
back version. Can you please check and rebase?thanks
ShvetaRebased.
Thanks. Please find a few comments:
1)
/* Any slot with NULL in these fields should not have made it this far */It is good to get rid of the case where we had checks for NULL
confirmed_lsn and catalog_xmin (i.e. when slot was in RS_EPHEMERAL
state), as that has already been checked by synchronize_slots() and
such a slot will not even reach wait_for_primary_slot_catchup(). But a
slot can still be invalidated on primary anytime, and thus during this
wait, we should check for primary's invalidation as we were doing in
v1.2)
+ * If in SQL API synchronization, and we've been promoted, then no pointextra space before promoted.
3)
+ if (!AmLogicalSlotSyncWorkerProcess() && PromoteIsTriggered())
We don't need 'AmLogicalSlotSyncWorkerProcess' as that is already
checked at the beginning of this function.4) + ereport(WARNING, + errmsg("aborting sync for slot \"%s\"", + remote_slot->name), + errdetail("Promotion occurred before this slot was fully" + " synchronized.")); + pfree(cmd.data); + + return false;a) Please add an error-code.
b) Shall we change msg to
errmsg("aborting sync for slot \"%s\"",
remote_slot->name),
errhint("%s cannot be executed once promotion is
triggered.","pg_sync_replication_slots()")));
5)
Instead of using PromoteIsTriggered, shall we rely on
'SlotSyncCtx->stopSignaled' as we do when we start this API.6)
In logicaldecoding.sgml, we can get rid of "Additionally, enabling
sync_replication_slots on the standby is required" to make it same as
what we had prior to the patch I pointed earlier.Or better we can refine it to below. Thoughts?
The logical replication slots on the primary can be enabled for
synchronization to the hot standby by using the failover parameter of
pg_create_logical_replication_slot, or by using the failover option of
CREATE SUBSCRIPTION during slot creation. After that, synchronization
can be performed either manually by calling pg_sync_replication_slots
on the standby, or automatically by enabling sync_replication_slots on
the standby. When sync_replication_slots is enabled, the failover
slots are periodically synchronized by the slot sync worker. For the
synchronization to work, .....I am wondering if we should provide an optional parameter to
pg_sync_replication_slots(), to control whether to wait for the slot
to be synced or just return with ERROR as it is doing now, default can
be wait.Do you mean specifically in case of promotion or in general, as in do
not wait for primary to catch-up (anytime) and exit and drop the
temporary slot while exiting?
I am specifically pointing to the exposed function
pg_sync_replication_slots() which was earlier non-blocking and was
giving an error if the primary slot for not catch-up, so we have
improved the functionality in this thread by making it wait for
primary to catch-up instead of just throwing an error. But my
question was since this is a user facing function so shall we keep the
old behavior intact with some optional parameters so that if the user
chooses not to wait they have options to do that?
--
Regards,
Dilip Kumar
Google
On Fri, Jul 18, 2025 at 10:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Jul 18, 2025 at 10:45 AM shveta malik <shveta.malik@gmail.com> wrote:
On Fri, Jul 18, 2025 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Jul 17, 2025 at 9:34 AM shveta malik <shveta.malik@gmail.com> wrote:
On Wed, Jul 16, 2025 at 3:47 PM Ajin Cherian <itsajin@gmail.com> wrote:
I am not able to apply the patch to the latest head or even to a week
back version. Can you please check and rebase?thanks
ShvetaRebased.
Thanks. Please find a few comments:
1)
/* Any slot with NULL in these fields should not have made it this far */It is good to get rid of the case where we had checks for NULL
confirmed_lsn and catalog_xmin (i.e. when slot was in RS_EPHEMERAL
state), as that has already been checked by synchronize_slots() and
such a slot will not even reach wait_for_primary_slot_catchup(). But a
slot can still be invalidated on primary anytime, and thus during this
wait, we should check for primary's invalidation as we were doing in
v1.2)
+ * If in SQL API synchronization, and we've been promoted, then no pointextra space before promoted.
3)
+ if (!AmLogicalSlotSyncWorkerProcess() && PromoteIsTriggered())
We don't need 'AmLogicalSlotSyncWorkerProcess' as that is already
checked at the beginning of this function.4) + ereport(WARNING, + errmsg("aborting sync for slot \"%s\"", + remote_slot->name), + errdetail("Promotion occurred before this slot was fully" + " synchronized.")); + pfree(cmd.data); + + return false;a) Please add an error-code.
b) Shall we change msg to
errmsg("aborting sync for slot \"%s\"",
remote_slot->name),
errhint("%s cannot be executed once promotion is
triggered.","pg_sync_replication_slots()")));
5)
Instead of using PromoteIsTriggered, shall we rely on
'SlotSyncCtx->stopSignaled' as we do when we start this API.6)
In logicaldecoding.sgml, we can get rid of "Additionally, enabling
sync_replication_slots on the standby is required" to make it same as
what we had prior to the patch I pointed earlier.Or better we can refine it to below. Thoughts?
The logical replication slots on the primary can be enabled for
synchronization to the hot standby by using the failover parameter of
pg_create_logical_replication_slot, or by using the failover option of
CREATE SUBSCRIPTION during slot creation. After that, synchronization
can be performed either manually by calling pg_sync_replication_slots
on the standby, or automatically by enabling sync_replication_slots on
the standby. When sync_replication_slots is enabled, the failover
slots are periodically synchronized by the slot sync worker. For the
synchronization to work, .....I am wondering if we should provide an optional parameter to
pg_sync_replication_slots(), to control whether to wait for the slot
to be synced or just return with ERROR as it is doing now, default can
be wait.Do you mean specifically in case of promotion or in general, as in do
not wait for primary to catch-up (anytime) and exit and drop the
temporary slot while exiting?I am specifically pointing to the exposed function
pg_sync_replication_slots() which was earlier non-blocking and was
giving an error if the primary slot for not catch-up, so we have
improved the functionality in this thread by making it wait for
primary to catch-up instead of just throwing an error. But my
question was since this is a user facing function so shall we keep the
old behavior intact with some optional parameters so that if the user
chooses not to wait they have options to do that?
Okay. I see your point. Yes, it was non-blocking earlier but it was
not giving ERROR, it was just dumping in logilfe that primary is
behind and thus slot-sync could not be done.
If we continue using the non-blocking mode, there’s a risk that the
API may never successfully sync the slots. This is because it
eventually drops the temporary slot on exit, and when it tries to
create a new one later on subsequent call, it’s likely that the new
slot will again be ahead of the primary. This may happen if we have
continuous ongoing writes on the primary and the logical slot is not
being consumed at the same pace.
My preference would be to avoid including such an option as it is
confusing. With such an option in place, users may think that
slot-sync is completed while that may not be the case. But if it's
necessary for backward compatibility, it should be okay to provide it
as a non-default option as you suggested. Would like to know what
others think of this.
thanks
Shveta
On Fri, Jul 18, 2025 at 11:25 AM shveta malik <shveta.malik@gmail.com> wrote:
Okay. I see your point. Yes, it was non-blocking earlier but it was
not giving ERROR, it was just dumping in logilfe that primary is
behind and thus slot-sync could not be done.If we continue using the non-blocking mode, there’s a risk that the
API may never successfully sync the slots. This is because it
eventually drops the temporary slot on exit, and when it tries to
create a new one later on subsequent call, it’s likely that the new
slot will again be ahead of the primary. This may happen if we have
continuous ongoing writes on the primary and the logical slot is not
being consumed at the same pace.My preference would be to avoid including such an option as it is
confusing. With such an option in place, users may think that
slot-sync is completed while that may not be the case.
Fair enough
But if it's
necessary for backward compatibility, it should be okay to provide it
as a non-default option as you suggested. Would like to know what
others think of this.
I think we don't need to maintain backward compatibility here as it
was not behaving sanely before, so I am fine with providing the
behaviour we are doing with the patch.
--
Regards,
Dilip Kumar
Google
On Fri, Jul 18, 2025 at 11:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Jul 18, 2025 at 11:25 AM shveta malik <shveta.malik@gmail.com> wrote:
Okay. I see your point. Yes, it was non-blocking earlier but it was
not giving ERROR, it was just dumping in logilfe that primary is
behind and thus slot-sync could not be done.If we continue using the non-blocking mode, there’s a risk that the
API may never successfully sync the slots. This is because it
eventually drops the temporary slot on exit, and when it tries to
create a new one later on subsequent call, it’s likely that the new
slot will again be ahead of the primary. This may happen if we have
continuous ongoing writes on the primary and the logical slot is not
being consumed at the same pace.My preference would be to avoid including such an option as it is
confusing. With such an option in place, users may think that
slot-sync is completed while that may not be the case.Fair enough
I think if we want we may return bool and return false when sync is
not complete say due to promotion or other reason like timeout.
However, at this stage it is not very clear whether it will be useful
to provide additional timeout parameter. But we can consider retruning
true/false depending on whether we are successful in syncing the slots
or not.
--
With Regards,
Amit Kapila.
On Sat, Jul 19, 2025 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Jul 18, 2025 at 11:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Jul 18, 2025 at 11:25 AM shveta malik <shveta.malik@gmail.com> wrote:
Okay. I see your point. Yes, it was non-blocking earlier but it was
not giving ERROR, it was just dumping in logilfe that primary is
behind and thus slot-sync could not be done.If we continue using the non-blocking mode, there’s a risk that the
API may never successfully sync the slots. This is because it
eventually drops the temporary slot on exit, and when it tries to
create a new one later on subsequent call, it’s likely that the new
slot will again be ahead of the primary. This may happen if we have
continuous ongoing writes on the primary and the logical slot is not
being consumed at the same pace.My preference would be to avoid including such an option as it is
confusing. With such an option in place, users may think that
slot-sync is completed while that may not be the case.Fair enough
I think if we want we may return bool and return false when sync is
not complete say due to promotion or other reason like timeout.
However, at this stage it is not very clear whether it will be useful
to provide additional timeout parameter. But we can consider retruning
true/false depending on whether we are successful in syncing the slots
or not.
I am not very sure if in the current scenario, such a return-value
will have any value addition. Since this function will be waiting
indefinitely until all the slots are synced, it is supposed to return
true in such normal scenarios. If it is interrupted by promotion or
user cancels it manually, then it is supposed to return false. But in
those cases, a more helpful approach would be to log a clear WARNING
or ERROR message like "sync interrupted by promotion" (or similar
reasons), rather than relying on a return value. In future, if we plan
to add a timeout-parameter, then this return value makes more sense as
in normal scenarios as well, as it can easily return false if the
timeout value is short or the number of slots are huge or are stuck
waiting on primary.
Additionally, if we do return a value, there may be an expectation
that the API should also provide details on the list of slots that
couldn't be synced. That could introduce unnecessary complexity at
this stage. We can avoid it for now and consider adding such
enhancements later if we receive relevant customer feedback. Please
note that our recommended approach for syncing slots still remains the
'slot sync worker' method.
thanks
Shveta
On Mon, Jul 21, 2025 at 10:08 AM shveta malik <shveta.malik@gmail.com> wrote:
On Sat, Jul 19, 2025 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Jul 18, 2025 at 11:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Jul 18, 2025 at 11:25 AM shveta malik <shveta.malik@gmail.com> wrote:
Okay. I see your point. Yes, it was non-blocking earlier but it was
not giving ERROR, it was just dumping in logilfe that primary is
behind and thus slot-sync could not be done.If we continue using the non-blocking mode, there’s a risk that the
API may never successfully sync the slots. This is because it
eventually drops the temporary slot on exit, and when it tries to
create a new one later on subsequent call, it’s likely that the new
slot will again be ahead of the primary. This may happen if we have
continuous ongoing writes on the primary and the logical slot is not
being consumed at the same pace.My preference would be to avoid including such an option as it is
confusing. With such an option in place, users may think that
slot-sync is completed while that may not be the case.Fair enough
I think if we want we may return bool and return false when sync is
not complete say due to promotion or other reason like timeout.
However, at this stage it is not very clear whether it will be useful
to provide additional timeout parameter. But we can consider retruning
true/false depending on whether we are successful in syncing the slots
or not.I am not very sure if in the current scenario, such a return-value
will have any value addition. Since this function will be waiting
indefinitely until all the slots are synced, it is supposed to return
true in such normal scenarios. If it is interrupted by promotion or
user cancels it manually, then it is supposed to return false. But in
those cases, a more helpful approach would be to log a clear WARNING
or ERROR message like "sync interrupted by promotion" (or similar
reasons), rather than relying on a return value. In future, if we plan
to add a timeout-parameter, then this return value makes more sense as
in normal scenarios as well, as it can easily return false if the
timeout value is short or the number of slots are huge or are stuck
waiting on primary.Additionally, if we do return a value, there may be an expectation
that the API should also provide details on the list of slots that
couldn't be synced. That could introduce unnecessary complexity at
this stage. We can avoid it for now and consider adding such
enhancements later if we receive relevant customer feedback.
makes sense.
Please
note that our recommended approach for syncing slots still remains the
'slot sync worker' method.
Right.
--
With Regards,
Amit Kapila.
On Mon, Jul 21, 2025 at 10:08 AM shveta malik <shveta.malik@gmail.com> wrote:
On Sat, Jul 19, 2025 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Jul 18, 2025 at 11:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Jul 18, 2025 at 11:25 AM shveta malik <shveta.malik@gmail.com> wrote:
Okay. I see your point. Yes, it was non-blocking earlier but it was
not giving ERROR, it was just dumping in logilfe that primary is
behind and thus slot-sync could not be done.If we continue using the non-blocking mode, there’s a risk that the
API may never successfully sync the slots. This is because it
eventually drops the temporary slot on exit, and when it tries to
create a new one later on subsequent call, it’s likely that the new
slot will again be ahead of the primary. This may happen if we have
continuous ongoing writes on the primary and the logical slot is not
being consumed at the same pace.My preference would be to avoid including such an option as it is
confusing. With such an option in place, users may think that
slot-sync is completed while that may not be the case.Fair enough
I think if we want we may return bool and return false when sync is
not complete say due to promotion or other reason like timeout.
However, at this stage it is not very clear whether it will be useful
to provide additional timeout parameter. But we can consider retruning
true/false depending on whether we are successful in syncing the slots
or not.I am not very sure if in the current scenario, such a return-value
will have any value addition. Since this function will be waiting
indefinitely until all the slots are synced, it is supposed to return
true in such normal scenarios. If it is interrupted by promotion or
user cancels it manually, then it is supposed to return false. But in
those cases, a more helpful approach would be to log a clear WARNING
or ERROR message like "sync interrupted by promotion" (or similar
reasons), rather than relying on a return value. In future, if we plan
to add a timeout-parameter, then this return value makes more sense as
in normal scenarios as well, as it can easily return false if the
timeout value is short or the number of slots are huge or are stuck
waiting on primary.
Additionally, if we do return a value, there may be an expectation
that the API should also provide details on the list of slots that
couldn't be synced. That could introduce unnecessary complexity at
this stage. We can avoid it for now and consider adding such
enhancements later if we receive relevant customer feedback. Please
note that our recommended approach for syncing slots still remains the
'slot sync worker' method.
+1
--
Regards,
Dilip Kumar
Google
On Thu, Jul 17, 2025 at 2:04 PM shveta malik <shveta.malik@gmail.com> wrote:
On Wed, Jul 16, 2025 at 3:47 PM Ajin Cherian <itsajin@gmail.com> wrote:
I am not able to apply the patch to the latest head or even to a week
back version. Can you please check and rebase?thanks
ShvetaRebased.
Thanks. Please find a few comments:
1)
/* Any slot with NULL in these fields should not have made it this far */It is good to get rid of the case where we had checks for NULL
confirmed_lsn and catalog_xmin (i.e. when slot was in RS_EPHEMERAL
state), as that has already been checked by synchronize_slots() and
such a slot will not even reach wait_for_primary_slot_catchup(). But a
slot can still be invalidated on primary anytime, and thus during this
wait, we should check for primary's invalidation as we were doing in
v1.
I've added back the check for invalidated slots.
2)
+ * If in SQL API synchronization, and we've been promoted, then no pointextra space before promoted.
Fixed.
3)
+ if (!AmLogicalSlotSyncWorkerProcess() && PromoteIsTriggered())
We don't need 'AmLogicalSlotSyncWorkerProcess' as that is already
checked at the beginning of this function.
Fixed.
4) + ereport(WARNING, + errmsg("aborting sync for slot \"%s\"", + remote_slot->name), + errdetail("Promotion occurred before this slot was fully" + " synchronized.")); + pfree(cmd.data); + + return false;a) Please add an error-code.
b) Shall we change msg to
errmsg("aborting sync for slot \"%s\"",
remote_slot->name),
errhint("%s cannot be executed once promotion is
triggered.","pg_sync_replication_slots()")));
Since there is already an error return in the start if promotion is
triggered, I've kept the same error code and message here as well for
consistency.
5)
Instead of using PromoteIsTriggered, shall we rely on
'SlotSyncCtx->stopSignaled' as we do when we start this API.
Fixed.
6)
In logicaldecoding.sgml, we can get rid of "Additionally, enabling
sync_replication_slots on the standby is required" to make it same as
what we had prior to the patch I pointed earlier.Or better we can refine it to below. Thoughts?
The logical replication slots on the primary can be enabled for
synchronization to the hot standby by using the failover parameter of
pg_create_logical_replication_slot, or by using the failover option of
CREATE SUBSCRIPTION during slot creation. After that, synchronization
can be performed either manually by calling pg_sync_replication_slots
on the standby, or automatically by enabling sync_replication_slots on
the standby. When sync_replication_slots is enabled, the failover
slots are periodically synchronized by the slot sync worker. For the
synchronization to work, .....
Updated as above.
Patch v3 attached.
Regards,
Ajin Cherian
Fujitsu Australia
Attachments:
v3-0001-Improve-initial-slot-synchronization-in-pg_sync_r.patchapplication/octet-stream; name=v3-0001-Improve-initial-slot-synchronization-in-pg_sync_r.patchDownload+225-41
On Thu, Jul 31, 2025 at 3:11 PM Ajin Cherian <itsajin@gmail.com> wrote:
Patch v3 attached.
Thanks for the patch. I tested it, please find a few comments:
1)
it hits an assert
(slotsync_reread_config()-->Assert(sync_replication_slots)) when API
is trying to sync and is in wait loop while in another session, I
enable sync_replication_slots using:
ALTER SYSTEM SET sync_replication_slots = 'on';
SELECT pg_reload_conf();
Assert:
025-08-01 10:55:43.637 IST [118576] STATEMENT: SELECT
pg_sync_replication_slots();
2025-08-01 10:55:51.730 IST [118563] LOG: received SIGHUP, reloading
configuration files
2025-08-01 10:55:51.731 IST [118563] LOG: parameter
"sync_replication_slots" changed to "on"
TRAP: failed Assert("sync_replication_slots"), File: "slotsync.c",
Line: 1334, PID: 118576
postgres: shveta postgres [local]
SELECT(ExceptionalCondition+0xbb)[0x61df0160e090]
postgres: shveta postgres [local] SELECT(+0x6520dc)[0x61df0133a0dc]
2025-08-01 10:55:51.739 IST [118666] ERROR: cannot synchronize
replication slots concurrently
postgres: shveta postgres [local] SELECT(+0x6522b2)[0x61df0133a2b2]
postgres: shveta postgres [local] SELECT(+0x650664)[0x61df01338664]
postgres: shveta postgres [local] SELECT(+0x650cf8)[0x61df01338cf8]
postgres: shveta postgres [local] SELECT(+0x6513ea)[0x61df013393ea]
postgres: shveta postgres [local] SELECT(+0x6519df)[0x61df013399df]
postgres: shveta postgres [local]
SELECT(SyncReplicationSlots+0xbb)[0x61df0133af60]
postgres: shveta postgres [local]
SELECT(pg_sync_replication_slots+0x1b1)[0x61df01357e52]
2)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot synchronize replication slots when"
+ " standby promotion is ongoing")));
I think better error message will be:
"exiting from slot synchronization as promotion is triggered"
This will be better suited in log file as well after below wait statements:
LOG: continuing to wait for remote slot "failover_slot" LSN
(0/3000060) and catalog xmin (755) to pass local slot LSN (0/3000060)
and catalog xmin (757)
STATEMENT: SELECT pg_sync_replication_slots();
3)
API dumps this when it is waiting for primary:
----
LOG: could not synchronize replication slot "failover_slot2"
DETAIL: Synchronization could lead to data loss, because the remote
slot needs WAL at LSN 0/03066E70 and catalog xmin 755, but the standby
has LSN 0/03066E70 and catalog xmin 770.
STATEMENT: SELECT pg_sync_replication_slots();
LOG: waiting for remote slot "failover_slot2" LSN (0/3066E70) and
catalog xmin (755) to pass local slot LSN (0/3066E70) and catalog xmin
(770)
STATEMENT: SELECT pg_sync_replication_slots();
LOG: continuing to wait for remote slot "failover_slot2" LSN
(0/3066E70) and catalog xmin (755) to pass local slot LSN (0/3066E70)
and catalog xmin (770)
STATEMENT: SELECT pg_sync_replication_slots();
----
Unsure if we shall still dump 'could not synchronize..' when it is
going to retry until it succeeds? The concerned log gives a feeling
that we are done trying and could not synchronize it. What do you
think?
thanks
Shveta
On Fri, Aug 1, 2025 at 12:02 PM shveta malik <shveta.malik@gmail.com> wrote:
On Thu, Jul 31, 2025 at 3:11 PM Ajin Cherian <itsajin@gmail.com> wrote:
Patch v3 attached.
Thanks for the patch. I tested it, please find a few comments:
1)
it hits an assert
(slotsync_reread_config()-->Assert(sync_replication_slots)) when API
is trying to sync and is in wait loop while in another session, I
enable sync_replication_slots using:ALTER SYSTEM SET sync_replication_slots = 'on';
SELECT pg_reload_conf();Assert:
025-08-01 10:55:43.637 IST [118576] STATEMENT: SELECT
pg_sync_replication_slots();
2025-08-01 10:55:51.730 IST [118563] LOG: received SIGHUP, reloading
configuration files
2025-08-01 10:55:51.731 IST [118563] LOG: parameter
"sync_replication_slots" changed to "on"
TRAP: failed Assert("sync_replication_slots"), File: "slotsync.c",
Line: 1334, PID: 118576
postgres: shveta postgres [local]
SELECT(ExceptionalCondition+0xbb)[0x61df0160e090]
postgres: shveta postgres [local] SELECT(+0x6520dc)[0x61df0133a0dc]
2025-08-01 10:55:51.739 IST [118666] ERROR: cannot synchronize
replication slots concurrently
postgres: shveta postgres [local] SELECT(+0x6522b2)[0x61df0133a2b2]
postgres: shveta postgres [local] SELECT(+0x650664)[0x61df01338664]
postgres: shveta postgres [local] SELECT(+0x650cf8)[0x61df01338cf8]
postgres: shveta postgres [local] SELECT(+0x6513ea)[0x61df013393ea]
postgres: shveta postgres [local] SELECT(+0x6519df)[0x61df013399df]
postgres: shveta postgres [local]
SELECT(SyncReplicationSlots+0xbb)[0x61df0133af60]
postgres: shveta postgres [local]
SELECT(pg_sync_replication_slots+0x1b1)[0x61df01357e52]2) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot synchronize replication slots when" + " standby promotion is ongoing")));I think better error message will be:
"exiting from slot synchronization as promotion is triggered"This will be better suited in log file as well after below wait statements:
LOG: continuing to wait for remote slot "failover_slot" LSN
(0/3000060) and catalog xmin (755) to pass local slot LSN (0/3000060)
and catalog xmin (757)
STATEMENT: SELECT pg_sync_replication_slots();3)
API dumps this when it is waiting for primary:----
LOG: could not synchronize replication slot "failover_slot2"
DETAIL: Synchronization could lead to data loss, because the remote
slot needs WAL at LSN 0/03066E70 and catalog xmin 755, but the standby
has LSN 0/03066E70 and catalog xmin 770.
STATEMENT: SELECT pg_sync_replication_slots();
LOG: waiting for remote slot "failover_slot2" LSN (0/3066E70) and
catalog xmin (755) to pass local slot LSN (0/3066E70) and catalog xmin
(770)
STATEMENT: SELECT pg_sync_replication_slots();
LOG: continuing to wait for remote slot "failover_slot2" LSN
(0/3066E70) and catalog xmin (755) to pass local slot LSN (0/3066E70)
and catalog xmin (770)
STATEMENT: SELECT pg_sync_replication_slots();
----Unsure if we shall still dump 'could not synchronize..' when it is
going to retry until it succeeds? The concerned log gives a feeling
that we are done trying and could not synchronize it. What do you
think?
A few more comments:
4)
+ if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot))
+ {
+ ereport(WARNING,
+ errmsg("aborting initial sync for slot \"%s\"",
+ remote_slot->name),
+ errdetail("This slot was not found on the primary server."));
+
+ pfree(cmd.data);
+ walrcv_clear_result(res);
+
+ return false;
+ }
We may have 'aborting sync for slot', can remove 'initial'.
5)
I tried a test where there were 4 slots on the publisher, where one
was getting used while the others were not. Initiated
pg_sync_replication_slots on standby. Forced unused slots to be
invalidated by setting idle_replication_slot_timeout=60 on primary,
due to which API finished but gave a warning:
postgres=# SELECT pg_sync_replication_slots();
WARNING: aborting initial sync for slot "failover_slot"
DETAIL: This slot was invalidated on the primary server.
WARNING: aborting initial sync for slot "failover_slot2"
DETAIL: This slot was invalidated on the primary server.
WARNING: aborting initial sync for slot "failover_slot3"
DETAIL: This slot was invalidated on the primary server.
pg_sync_replication_slots
---------------------------
(1 row)
Do we need these warnings here? I think we can have it as a LOG rather
than having it on console. Thoughts?
If we inclined towards WARNING here, will ti be better to have it as
a single line:
WARNING: aborting sync for slot "failover_slot" as the slot was
invalidated on primary
WARNING: aborting sync for slot "failover_slot1" as the slot was
invalidated on primary
WARNING: aborting sync for slot "failover_slot2" as the slot was
invalidated on primary
6)
- * We do not drop the slot because the restart_lsn can be ahead of the
- * current location when recreating the slot in the next cycle. It may
- * take more time to create such a slot. Therefore, we keep this slot
- * and attempt the synchronization in the next cycle.
+ * If we're in the slotsync worker, we retain the slot and retry in the
+ * next cycle. The restart_lsn might advance by then, allowing the slot
+ * to be created successfully later.
*/
I like the previous command better as it was conveying the side-effect
of dropping the slot herer. Can we try to incorporate the previous
comment here and make it specific to slotsync workers?
thanks
Shveta