promotion related handling in pg_sync_replication_slots()
Hello hackers,
Currently, promotion related handling is missing in the slot sync SQL
function pg_sync_replication_slots(). Here is the background on how
it is done in slot sync worker:
During promotion, the startup process in order to shut down the
slot-sync worker, sets the 'stopSignaled' flag, sends the shut-down
signal, and waits for slot sync worker to exit. Meanwhile if the
postmaster has not noticed the promotion yet, it may end up restarting
slot sync worker. In such a case, the worker exits if 'stopSignaled'
is set.
Since there is a chance that the user (or any of his scripts/tools)
may execute SQL function pg_sync_replication_slots() in parallel to
promotion, such handling is needed in this SQL function as well, The
attached patch attempts to implement the same. Changes are:
1) If pg_sync_replication_slots() is already running when the
promotion is triggered, ShutDownSlotSync() checks the
'SlotSyncCtx->syncing' flag as well and waits for it to become false
i.e. waits till parallel running SQL function is finished.
2) If pg_sync_replication_slots() is invoked when promotion is
already in progress, pg_sync_replication_slots() respects the
'stopSignaled' flag set by the startup process and becomes a no-op.
thanks
Shveta
Attachments:
v1-0001-Handle-stopSignaled-during-sync-function-call.patchapplication/octet-stream; name=v1-0001-Handle-stopSignaled-during-sync-function-call.patchDownload+32-6
On Thu, Apr 4, 2024 at 5:05 PM shveta malik <shveta.malik@gmail.com> wrote:
Hello hackers,
Currently, promotion related handling is missing in the slot sync SQL
function pg_sync_replication_slots(). Here is the background on how
it is done in slot sync worker:
During promotion, the startup process in order to shut down the
slot-sync worker, sets the 'stopSignaled' flag, sends the shut-down
signal, and waits for slot sync worker to exit. Meanwhile if the
postmaster has not noticed the promotion yet, it may end up restarting
slot sync worker. In such a case, the worker exits if 'stopSignaled'
is set.Since there is a chance that the user (or any of his scripts/tools)
may execute SQL function pg_sync_replication_slots() in parallel to
promotion, such handling is needed in this SQL function as well, The
attached patch attempts to implement the same.
Thanks for the report and patch. I'll look into it.
--
With Regards,
Amit Kapila.
On Thu, Apr 4, 2024 at 5:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Apr 4, 2024 at 5:05 PM shveta malik <shveta.malik@gmail.com> wrote:
Hello hackers,
Currently, promotion related handling is missing in the slot sync SQL
function pg_sync_replication_slots(). Here is the background on how
it is done in slot sync worker:
During promotion, the startup process in order to shut down the
slot-sync worker, sets the 'stopSignaled' flag, sends the shut-down
signal, and waits for slot sync worker to exit. Meanwhile if the
postmaster has not noticed the promotion yet, it may end up restarting
slot sync worker. In such a case, the worker exits if 'stopSignaled'
is set.Since there is a chance that the user (or any of his scripts/tools)
may execute SQL function pg_sync_replication_slots() in parallel to
promotion, such handling is needed in this SQL function as well, The
attached patch attempts to implement the same.Thanks for the report and patch. I'll look into it.
Please find v2. Changes are:
1) Rebased the patch as there was conflict due to recent commit 6f132ed.
2) Added an Assert in update_synced_slots_inactive_since() to ensure
that the slot does not have active_pid.
3) Improved commit msg and comments.
thanks
Shveta
Attachments:
v2-0001-Handle-stopSignaled-during-sync-function-call.patchapplication/octet-stream; name=v2-0001-Handle-stopSignaled-during-sync-function-call.patchDownload+33-6
On Fri, Apr 5, 2024 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote:
Please find v2. Changes are:
1) Rebased the patch as there was conflict due to recent commit 6f132ed.
2) Added an Assert in update_synced_slots_inactive_since() to ensure
that the slot does not have active_pid.
3) Improved commit msg and comments.
Few comments:
==============
1.
void
SyncReplicationSlots(WalReceiverConn *wrconn)
{
+ /*
+ * Startup process signaled the slot sync to stop, so if meanwhile user
+ * has invoked slot sync SQL function, simply return.
+ */
+ SpinLockAcquire(&SlotSyncCtx->mutex);
+ if (SlotSyncCtx->stopSignaled)
+ {
+ ereport(LOG,
+ errmsg("skipping slot synchronization as slot sync shutdown is
signaled during promotion"));
+
+ SpinLockRelease(&SlotSyncCtx->mutex);
+ return;
+ }
+ SpinLockRelease(&SlotSyncCtx->mutex);
There is a race condition with this code. Say during promotion
ShutDownSlotSync() is just before setting this flag and the user has
invoked pg_sync_replication_slots() and passed this check but still
didn't set the SlotSyncCtx->syncing flag. So, now, the promotion would
recognize that there is slot sync going on in parallel, and slot sync
wouldn't know that the promotion is in progress.
The current coding for slot sync worker ensures there is no such race
by checking the stopSignaled and setting the pid together under
spinlock. I think we need to move the setting of the syncing flag
similarly. Once we do that probably checking SlotSyncCtx->syncing
should be sufficient in ShutDownSlotSync(). If we change the location
of setting the 'syncing' flag then please ensure its cleanup as we
currently do in slotsync_failure_callback().
2.
@@ -1395,6 +1395,7 @@ update_synced_slots_inactive_since(void)
if (s->in_use && s->data.synced)
{
Assert(SlotIsLogical(s));
+ Assert(s->active_pid == 0);
We can add a comment like: "The slot must not be acquired by any process"
--
With Regards,
Amit Kapila.
On Fri, Apr 5, 2024 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote:
Please find v2. Changes are:
Thanks for the patch. Here are some comments.
1. Can we have a clear saying in the shmem variable who's syncing at
the moment? Is it a slot sync worker or a backend via SQL function?
Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;"
typedef enum SlotSyncSource
{
SLOT_SYNC_NONE,
SLOT_SYNC_WORKER,
SLOT_SYNC_BACKEND,
} SlotSyncSource;
Then, the check in ShutDownSlotSync can be:
+ /*
+ * Return if neither the slot sync worker is running nor the function
+ * pg_sync_replication_slots() is executing.
+ */
+ if ((SlotSyncCtx->pid == InvalidPid) &&
SlotSyncCtx->sync_source != SLOT_SYNC_BACKEND)
{
2.
SyncReplicationSlots(WalReceiverConn *wrconn)
{
+ /*
+ * Startup process signaled the slot sync to stop, so if meanwhile user
+ * has invoked slot sync SQL function, simply return.
+ */
+ SpinLockAcquire(&SlotSyncCtx->mutex);
+ if (SlotSyncCtx->stopSignaled)
+ {
+ ereport(LOG,
+ errmsg("skipping slot synchronization as slot sync
shutdown is signaled during promotion"));
+
Unless I'm missing something, I think this can't detect if the backend
via SQL function is already half-way through syncing in
synchronize_one_slot. So, better move this check to (or also have it
there) slot sync loop that calls synchronize_one_slot. To avoid
spinlock acquisitions, we can perhaps do this check in when we acquire
the spinlock for synced flag.
/* Search for the named slot */
if ((slot = SearchNamedReplicationSlot(remote_slot->name, true)))
{
bool synced;
SpinLockAcquire(&slot->mutex);
synced = slot->data.synced;
<< get SlotSyncCtx->stopSignaled here >>
SpinLockRelease(&slot->mutex);
<< do the slot sync skip check here if (stopSignaled) >>
3. Can we have a test or steps at least to check the consequences
manually one can get if slot syncing via SQL function is happening
during the promotion?
IIUC, we need to ensure there is no backend acquiring it and
performing sync while the slot sync worker is shutting down/standby
promotion is occuring. Otherwise, some of the slots can get resynced
and some are not while we are shutting down the slot sync worker as
part of the standby promotion which might leave the slots in an
inconsistent state.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Sat, Apr 6, 2024 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Apr 5, 2024 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote:
Please find v2. Changes are:
1) Rebased the patch as there was conflict due to recent commit 6f132ed.
2) Added an Assert in update_synced_slots_inactive_since() to ensure
that the slot does not have active_pid.
3) Improved commit msg and comments.Few comments: ============== 1. void SyncReplicationSlots(WalReceiverConn *wrconn) { + /* + * Startup process signaled the slot sync to stop, so if meanwhile user + * has invoked slot sync SQL function, simply return. + */ + SpinLockAcquire(&SlotSyncCtx->mutex); + if (SlotSyncCtx->stopSignaled) + { + ereport(LOG, + errmsg("skipping slot synchronization as slot sync shutdown is signaled during promotion")); + + SpinLockRelease(&SlotSyncCtx->mutex); + return; + } + SpinLockRelease(&SlotSyncCtx->mutex);There is a race condition with this code. Say during promotion
ShutDownSlotSync() is just before setting this flag and the user has
invoked pg_sync_replication_slots() and passed this check but still
didn't set the SlotSyncCtx->syncing flag. So, now, the promotion would
recognize that there is slot sync going on in parallel, and slot sync
wouldn't know that the promotion is in progress.
Did you mean that now, the promotion *would not* recognize...
I see, I will fix this.
The current coding for slot sync worker ensures there is no such race
by checking the stopSignaled and setting the pid together under
spinlock. I think we need to move the setting of the syncing flag
similarly. Once we do that probably checking SlotSyncCtx->syncing
should be sufficient in ShutDownSlotSync(). If we change the location
of setting the 'syncing' flag then please ensure its cleanup as we
currently do in slotsync_failure_callback().
Sure, let me review.
Show quoted text
2.
@@ -1395,6 +1395,7 @@ update_synced_slots_inactive_since(void)
if (s->in_use && s->data.synced)
{
Assert(SlotIsLogical(s));
+ Assert(s->active_pid == 0);We can add a comment like: "The slot must not be acquired by any process"
--
With Regards,
Amit Kapila.
On Sat, Apr 6, 2024 at 12:25 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
On Fri, Apr 5, 2024 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote:
Please find v2. Changes are:
Thanks for the patch. Here are some comments.
Thanks for reviewing.
1. Can we have a clear saying in the shmem variable who's syncing at
the moment? Is it a slot sync worker or a backend via SQL function?
Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;"typedef enum SlotSyncSource
{
SLOT_SYNC_NONE,
SLOT_SYNC_WORKER,
SLOT_SYNC_BACKEND,
} SlotSyncSource;Then, the check in ShutDownSlotSync can be:
+ /* + * Return if neither the slot sync worker is running nor the function + * pg_sync_replication_slots() is executing. + */ + if ((SlotSyncCtx->pid == InvalidPid) && SlotSyncCtx->sync_source != SLOT_SYNC_BACKEND) {2. SyncReplicationSlots(WalReceiverConn *wrconn) { + /* + * Startup process signaled the slot sync to stop, so if meanwhile user + * has invoked slot sync SQL function, simply return. + */ + SpinLockAcquire(&SlotSyncCtx->mutex); + if (SlotSyncCtx->stopSignaled) + { + ereport(LOG, + errmsg("skipping slot synchronization as slot sync shutdown is signaled during promotion")); +Unless I'm missing something, I think this can't detect if the backend
via SQL function is already half-way through syncing in
synchronize_one_slot. So, better move this check to (or also have it
there) slot sync loop that calls synchronize_one_slot. To avoid
spinlock acquisitions, we can perhaps do this check in when we acquire
the spinlock for synced flag.
If the sync via SQL function is already half-way, then promotion
should wait for it to finish. I don't think it is a good idea to move
the check to synchronize_one_slot(). The sync-call should either not
start (if it noticed the promotion) or finish the sync and then let
promotion proceed. But I would like to know others' opinion on this.
/* Search for the named slot */
if ((slot = SearchNamedReplicationSlot(remote_slot->name, true)))
{
bool synced;SpinLockAcquire(&slot->mutex);
synced = slot->data.synced;
<< get SlotSyncCtx->stopSignaled here >>
SpinLockRelease(&slot->mutex);<< do the slot sync skip check here if (stopSignaled) >>
3. Can we have a test or steps at least to check the consequences
manually one can get if slot syncing via SQL function is happening
during the promotion?IIUC, we need to ensure there is no backend acquiring it and
performing sync while the slot sync worker is shutting down/standby
promotion is occuring. Otherwise, some of the slots can get resynced
and some are not while we are shutting down the slot sync worker as
part of the standby promotion which might leave the slots in an
inconsistent state.
I do not think that we can reach a state (exception is some error
scenario) where some of the slots are synced while the rest are not
during a *particular* sync-cycle only because promotion is going in
parallel. (And yes we need to fix the race-condition stated by Amit
up-thread for this statement to be true.)
thanks
Shveta
On Fri, Apr 12, 2024 at 7:47 AM shveta malik <shveta.malik@gmail.com> wrote:
On Sat, Apr 6, 2024 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
Few comments: ============== 1. void SyncReplicationSlots(WalReceiverConn *wrconn) { + /* + * Startup process signaled the slot sync to stop, so if meanwhile user + * has invoked slot sync SQL function, simply return. + */ + SpinLockAcquire(&SlotSyncCtx->mutex); + if (SlotSyncCtx->stopSignaled) + { + ereport(LOG, + errmsg("skipping slot synchronization as slot sync shutdown is signaled during promotion")); + + SpinLockRelease(&SlotSyncCtx->mutex); + return; + } + SpinLockRelease(&SlotSyncCtx->mutex);There is a race condition with this code. Say during promotion
ShutDownSlotSync() is just before setting this flag and the user has
invoked pg_sync_replication_slots() and passed this check but still
didn't set the SlotSyncCtx->syncing flag. So, now, the promotion would
recognize that there is slot sync going on in parallel, and slot sync
wouldn't know that the promotion is in progress.Did you mean that now, the promotion *would not* recognize...
Right.
I see, I will fix this.
Thanks.
--
With Regards,
Amit Kapila.
On Fri, Apr 12, 2024 at 7:57 AM shveta malik <shveta.malik@gmail.com> wrote:
On Sat, Apr 6, 2024 at 12:25 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:On Fri, Apr 5, 2024 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote:
Please find v2. Changes are:
Thanks for the patch. Here are some comments.
Thanks for reviewing.
1. Can we have a clear saying in the shmem variable who's syncing at
the moment? Is it a slot sync worker or a backend via SQL function?
Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;"typedef enum SlotSyncSource
{
SLOT_SYNC_NONE,
SLOT_SYNC_WORKER,
SLOT_SYNC_BACKEND,
} SlotSyncSource;Then, the check in ShutDownSlotSync can be:
+ /* + * Return if neither the slot sync worker is running nor the function + * pg_sync_replication_slots() is executing. + */ + if ((SlotSyncCtx->pid == InvalidPid) && SlotSyncCtx->sync_source != SLOT_SYNC_BACKEND) {
I don't know if this will be help, especially after fixing the race
condition I mentioned. But otherwise, also, at this stage it doesn't
seem helpful to add the source of sync explicitly.
2. SyncReplicationSlots(WalReceiverConn *wrconn) { + /* + * Startup process signaled the slot sync to stop, so if meanwhile user + * has invoked slot sync SQL function, simply return. + */ + SpinLockAcquire(&SlotSyncCtx->mutex); + if (SlotSyncCtx->stopSignaled) + { + ereport(LOG, + errmsg("skipping slot synchronization as slot sync shutdown is signaled during promotion")); +Unless I'm missing something, I think this can't detect if the backend
via SQL function is already half-way through syncing in
synchronize_one_slot. So, better move this check to (or also have it
there) slot sync loop that calls synchronize_one_slot. To avoid
spinlock acquisitions, we can perhaps do this check in when we acquire
the spinlock for synced flag.If the sync via SQL function is already half-way, then promotion
should wait for it to finish. I don't think it is a good idea to move
the check to synchronize_one_slot(). The sync-call should either not
start (if it noticed the promotion) or finish the sync and then let
promotion proceed. But I would like to know others' opinion on this.
Agreed.
--
With Regards,
Amit Kapila.
On Fri, Apr 12, 2024 at 4:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Apr 12, 2024 at 7:57 AM shveta malik <shveta.malik@gmail.com> wrote:
On Sat, Apr 6, 2024 at 12:25 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:On Fri, Apr 5, 2024 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote:
Please find v2. Changes are:
Thanks for the patch. Here are some comments.
Thanks for reviewing.
1. Can we have a clear saying in the shmem variable who's syncing at
the moment? Is it a slot sync worker or a backend via SQL function?
Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;"typedef enum SlotSyncSource
{
SLOT_SYNC_NONE,
SLOT_SYNC_WORKER,
SLOT_SYNC_BACKEND,
} SlotSyncSource;Then, the check in ShutDownSlotSync can be:
+ /* + * Return if neither the slot sync worker is running nor the function + * pg_sync_replication_slots() is executing. + */ + if ((SlotSyncCtx->pid == InvalidPid) && SlotSyncCtx->sync_source != SLOT_SYNC_BACKEND) {I don't know if this will be help, especially after fixing the race
condition I mentioned. But otherwise, also, at this stage it doesn't
seem helpful to add the source of sync explicitly.
Agreed.
Please find v3 addressing race-condition and one other comment.
Up-thread it was suggested that, probably, checking
SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
re-thinking, it might not be. Slot sync worker sets and resets
'syncing' with each sync-cycle, and thus we need to rely on worker's
pid in ShutDownSlotSync(), as there could be a window where promotion
is triggered and 'syncing' is not set for worker, while the worker is
still running. This implementation of setting and resetting syncing
with each sync-cycle looks better as compared to setting syncing
during the entire life-cycle of the worker. So, I did not change it.
To fix the race condition, I moved the setting of the 'syncing' flag
together with the 'stopSignaled' check under the same spinLock for the
SQL function. OTOH, for worker, I feel it is good to check
'stopSignaled' at the beginning itself, while retaining the
setting/resetting of 'syncing' at a later stage during the actual sync
cycle. This makes handling for SQL function and worker slightly
different. And thus to achieve this, I had to take the 'syncing' flag
handling out of synchronize_slots() and move it to both worker and SQL
function by introducing 2 new functions check_and_set_syncing_flag()
and reset_syncing_flag().
I am analyzing if there are better ways to achieve this, any
suggestions are welcome.
thanks
Shveta
Attachments:
v3-0001-Handle-stopSignaled-during-sync-function-call.patchapplication/octet-stream; name=v3-0001-Handle-stopSignaled-during-sync-function-call.patchDownload+90-31
On Fri, Apr 12, 2024 at 5:25 PM shveta malik <shveta.malik@gmail.com> wrote:
Please find v3 addressing race-condition and one other comment.
Up-thread it was suggested that, probably, checking
SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
re-thinking, it might not be. Slot sync worker sets and resets
'syncing' with each sync-cycle, and thus we need to rely on worker's
pid in ShutDownSlotSync(), as there could be a window where promotion
is triggered and 'syncing' is not set for worker, while the worker is
still running. This implementation of setting and resetting syncing
with each sync-cycle looks better as compared to setting syncing
during the entire life-cycle of the worker. So, I did not change it.
To retain this we need to have different handling for 'syncing' for
workers and function which seems like more maintenance burden than the
value it provides. Moreover, in SyncReplicationSlots(), we are calling
a function after acquiring spinlock which is not our usual coding
practice.
One minor comment:
* All the fields except 'syncing' are used only by slotsync worker.
* 'syncing' is used both by worker and SQL function pg_sync_replication_slots.
*/
typedef struct SlotSyncCtxStruct
{
pid_t pid;
bool stopSignaled;
bool syncing;
time_t last_start_time;
slock_t mutex;
} SlotSyncCtxStruct;
I feel the above comment is no longer valid after this patch. We can
probably remove this altogether.
--
With Regards,
Amit Kapila.
On Mon, Apr 15, 2024 at 2:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Apr 12, 2024 at 5:25 PM shveta malik <shveta.malik@gmail.com> wrote:
Please find v3 addressing race-condition and one other comment.
Up-thread it was suggested that, probably, checking
SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
re-thinking, it might not be. Slot sync worker sets and resets
'syncing' with each sync-cycle, and thus we need to rely on worker's
pid in ShutDownSlotSync(), as there could be a window where promotion
is triggered and 'syncing' is not set for worker, while the worker is
still running. This implementation of setting and resetting syncing
with each sync-cycle looks better as compared to setting syncing
during the entire life-cycle of the worker. So, I did not change it.To retain this we need to have different handling for 'syncing' for
workers and function which seems like more maintenance burden than the
value it provides. Moreover, in SyncReplicationSlots(), we are calling
a function after acquiring spinlock which is not our usual coding
practice.
Okay. Changed it to consistent handling. Now both worker and SQL
function set 'syncing' when they start and reset it when they exit.
One minor comment:
* All the fields except 'syncing' are used only by slotsync worker.
* 'syncing' is used both by worker and SQL function pg_sync_replication_slots.
*/
typedef struct SlotSyncCtxStruct
{
pid_t pid;
bool stopSignaled;
bool syncing;
time_t last_start_time;
slock_t mutex;
} SlotSyncCtxStruct;I feel the above comment is no longer valid after this patch. We can
probably remove this altogether.
Yes, changed.
Please find v4 addressing the above comments.
thanks
Shveta
Attachments:
v4-0001-Handle-stopSignaled-during-sync-function-call.patchapplication/octet-stream; name=v4-0001-Handle-stopSignaled-during-sync-function-call.patchDownload+122-75
Hi,
On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
On Mon, Apr 15, 2024 at 2:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Apr 12, 2024 at 5:25 PM shveta malik <shveta.malik@gmail.com> wrote:
Please find v3 addressing race-condition and one other comment.
Up-thread it was suggested that, probably, checking
SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
re-thinking, it might not be. Slot sync worker sets and resets
'syncing' with each sync-cycle, and thus we need to rely on worker's
pid in ShutDownSlotSync(), as there could be a window where promotion
is triggered and 'syncing' is not set for worker, while the worker is
still running. This implementation of setting and resetting syncing
with each sync-cycle looks better as compared to setting syncing
during the entire life-cycle of the worker. So, I did not change it.To retain this we need to have different handling for 'syncing' for
workers and function which seems like more maintenance burden than the
value it provides. Moreover, in SyncReplicationSlots(), we are calling
a function after acquiring spinlock which is not our usual coding
practice.Okay. Changed it to consistent handling.
Thanks for the patch!
Now both worker and SQL
function set 'syncing' when they start and reset it when they exit.
It means that it's not possible anymore to trigger a manual sync if
sync_replication_slots is on. Indeed that would trigger:
postgres=# select pg_sync_replication_slots();
ERROR: cannot synchronize replication slots concurrently
That looks like an issue to me, thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
On Mon, Apr 15, 2024 at 2:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Apr 12, 2024 at 5:25 PM shveta malik <shveta.malik@gmail.com> wrote:
Please find v3 addressing race-condition and one other comment.
Up-thread it was suggested that, probably, checking
SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
re-thinking, it might not be. Slot sync worker sets and resets
'syncing' with each sync-cycle, and thus we need to rely on worker's
pid in ShutDownSlotSync(), as there could be a window where promotion
is triggered and 'syncing' is not set for worker, while the worker is
still running. This implementation of setting and resetting syncing
with each sync-cycle looks better as compared to setting syncing
during the entire life-cycle of the worker. So, I did not change it.To retain this we need to have different handling for 'syncing' for
workers and function which seems like more maintenance burden than the
value it provides. Moreover, in SyncReplicationSlots(), we are calling
a function after acquiring spinlock which is not our usual coding
practice.Okay. Changed it to consistent handling.
Thanks for the patch!
Now both worker and SQL
function set 'syncing' when they start and reset it when they exit.It means that it's not possible anymore to trigger a manual sync if
sync_replication_slots is on. Indeed that would trigger:postgres=# select pg_sync_replication_slots();
ERROR: cannot synchronize replication slots concurrentlyThat looks like an issue to me, thoughts?
This is intentional as of now for the sake of keeping
implementation/code simple. It is not difficult to allow them but I am
not sure whether we want to add another set of conditions allowing
them in parallel. And that too in an unpredictable way as the API will
work only for the time slot sync worker is not performing the sync.
--
With Regards,
Amit Kapila.
Hi,
On Mon, Apr 15, 2024 at 06:29:49PM +0530, Amit Kapila wrote:
On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
Now both worker and SQL
function set 'syncing' when they start and reset it when they exit.It means that it's not possible anymore to trigger a manual sync if
sync_replication_slots is on. Indeed that would trigger:postgres=# select pg_sync_replication_slots();
ERROR: cannot synchronize replication slots concurrentlyThat looks like an issue to me, thoughts?
This is intentional as of now for the sake of keeping
implementation/code simple. It is not difficult to allow them but I am
not sure whether we want to add another set of conditions allowing
them in parallel.
I think that the ability to launch a manual sync before a switchover would be
missed. Except for this case I don't think that's an issue to prevent them to
run in parallel.
And that too in an unpredictable way as the API will
work only for the time slot sync worker is not performing the sync.
Yeah but then at least you would know that there is "really" a sync in progress
(which is not the case currently with v4, as the sync worker being started is
enough to prevent a manual sync even if a sync is not in progress).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Apr 15, 2024 at 7:47 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
On Mon, Apr 15, 2024 at 06:29:49PM +0530, Amit Kapila wrote:
On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
Now both worker and SQL
function set 'syncing' when they start and reset it when they exit.It means that it's not possible anymore to trigger a manual sync if
sync_replication_slots is on. Indeed that would trigger:postgres=# select pg_sync_replication_slots();
ERROR: cannot synchronize replication slots concurrentlyThat looks like an issue to me, thoughts?
This is intentional as of now for the sake of keeping
implementation/code simple. It is not difficult to allow them but I am
not sure whether we want to add another set of conditions allowing
them in parallel.I think that the ability to launch a manual sync before a switchover would be
missed. Except for this case I don't think that's an issue to prevent them to
run in parallel.
I think if the slotsync worker is available, it can do that as well.
There is no clear use case for allowing them in parallel and I feel it
would add more confusion when it can work sometimes but not other
times. However, if we receive some report from the field where there
is a real demand for such a thing, it should be easy to achieve. For
example, I can imagine that we can have sync_state that has values
'started', 'in_progress' , and 'finished'. This should allow us to
achieve what the current proposed patch is doing along with allowing
the API to work in parallel when the sync_state is not 'in_progress'.
I think for now let's restrict their usage in parallel and make the
promotion behavior consistent both for worker and API.
--
With Regards,
Amit Kapila.
On Monday, April 15, 2024 6:09 PM shveta malik <shveta.malik@gmail.com> wrote:
Please find v4 addressing the above comments.
Thanks for the patch.
Here are few comments:
1.
+ ereport(ERROR,
+ errmsg("promotion in progress, can not synchronize replication slots"));
+ }
I think an errcode is needed.
The style of the error message seems a bit unnatural to me. I suggest:
"cannot synchronize replication slots when standby promotion is ongoing"
2.
+ if (worker_pid != InvalidPid)
+ Assert(SlotSyncCtx->pid == InvalidPid);
We could merge the checks into one Assert().
Assert(SlotSyncCtx->pid == InvalidPid || worker_pid == InvalidPid);
3.
- pqsignal(SIGINT, SignalHandlerForShutdownRequest);
I realized that we should register this before setting SlotSyncCtx->pid,
otherwise if the standby is promoted after setting pid and before registering
signal handle function, the slotsync worker could miss to handle SIGINT sent by
startup process(ShutDownSlotSync). This is an existing issue for slotsync
worker, but maybe we could fix it together with the patch.
Best Regards,
Hou zj
On Tue, Apr 16, 2024 at 9:27 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
On Monday, April 15, 2024 6:09 PM shveta malik <shveta.malik@gmail.com> wrote:
Please find v4 addressing the above comments.
Thanks for the patch.
Here are few comments:
Thanks for reviewing the patch.
1.
+ ereport(ERROR, + errmsg("promotion in progress, can not synchronize replication slots")); + }I think an errcode is needed.
The style of the error message seems a bit unnatural to me. I suggest:
"cannot synchronize replication slots when standby promotion is ongoing"
Modified.
2.
+ if (worker_pid != InvalidPid) + Assert(SlotSyncCtx->pid == InvalidPid);We could merge the checks into one Assert().
Assert(SlotSyncCtx->pid == InvalidPid || worker_pid == InvalidPid);
Modified.
3.
- pqsignal(SIGINT, SignalHandlerForShutdownRequest);
I realized that we should register this before setting SlotSyncCtx->pid,
otherwise if the standby is promoted after setting pid and before registering
signal handle function, the slotsync worker could miss to handle SIGINT sent by
startup process(ShutDownSlotSync). This is an existing issue for slotsync
worker, but maybe we could fix it together with the patch.
Yes, it seems like a problem. Fixed it. Also to be consistent, moved
other signal handlers' registration as well before we set pid.
Please find v5 addressing above comments.
thanks
Shveta
Attachments:
v5-0001-Handle-stopSignaled-during-sync-function-call.patchapplication/octet-stream; name=v5-0001-Handle-stopSignaled-during-sync-function-call.patchDownload+124-77
Hi,
On Tue, Apr 16, 2024 at 08:21:04AM +0530, Amit Kapila wrote:
On Mon, Apr 15, 2024 at 7:47 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:On Mon, Apr 15, 2024 at 06:29:49PM +0530, Amit Kapila wrote:
On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
Now both worker and SQL
function set 'syncing' when they start and reset it when they exit.It means that it's not possible anymore to trigger a manual sync if
sync_replication_slots is on. Indeed that would trigger:postgres=# select pg_sync_replication_slots();
ERROR: cannot synchronize replication slots concurrentlyThat looks like an issue to me, thoughts?
This is intentional as of now for the sake of keeping
implementation/code simple. It is not difficult to allow them but I am
not sure whether we want to add another set of conditions allowing
them in parallel.I think that the ability to launch a manual sync before a switchover would be
missed. Except for this case I don't think that's an issue to prevent them to
run in parallel.I think if the slotsync worker is available, it can do that as well.
Right, but one has no control as to when the sync is triggered.
There is no clear use case for allowing them in parallel and I feel it
would add more confusion when it can work sometimes but not other
times. However, if we receive some report from the field where there
is a real demand for such a thing, it should be easy to achieve. For
example, I can imagine that we can have sync_state that has values
'started', 'in_progress' , and 'finished'. This should allow us to
achieve what the current proposed patch is doing along with allowing
the API to work in parallel when the sync_state is not 'in_progress'.I think for now let's restrict their usage in parallel and make the
promotion behavior consistent both for worker and API.
Okay, let's do it that way. Is it worth to add a few words in the doc related to
pg_sync_replication_slots() though? (to mention it can not be used if the sync
slot worker is running).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Tue, Apr 16, 2024 at 12:03 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
On Tue, Apr 16, 2024 at 08:21:04AM +0530, Amit Kapila wrote:
There is no clear use case for allowing them in parallel and I feel it
would add more confusion when it can work sometimes but not other
times. However, if we receive some report from the field where there
is a real demand for such a thing, it should be easy to achieve. For
example, I can imagine that we can have sync_state that has values
'started', 'in_progress' , and 'finished'. This should allow us to
achieve what the current proposed patch is doing along with allowing
the API to work in parallel when the sync_state is not 'in_progress'.I think for now let's restrict their usage in parallel and make the
promotion behavior consistent both for worker and API.Okay, let's do it that way. Is it worth to add a few words in the doc related to
pg_sync_replication_slots() though? (to mention it can not be used if the sync
slot worker is running).
Yes, this makes sense to me.
--
With Regards,
Amit Kapila.