How can end users know the cause of LR slot sync delays?

Started by Ashutosh Sharma6 months ago75 messages
Jump to latest
#1Ashutosh Sharma
ashu.coek88@gmail.com

Hi,

We have seen cases where slot synchronization gets delayed, for example
when the slot is behind the failover standby or vice versa, and the slot
sync worker has to wait for one to catch up with the other. During this
waiting period, users querying pg_replication_slots can only see whether
the slot has been synchronized or not. If it has already synchronized,
that’s fine, but if synchronization is taking longer, users would naturally
want to understand the reason for the delay.

Is there a way for end users to know the cause of slot synchronization
delays, so they can take appropriate actions to speed it up?

I understand that server logs are emitted in such cases, but logs are not
something end users would want to check regularly. Moreover, since logging
is configuration-based, relevant messages may sometimes be skipped or
suppressed.

Thanks & Regards,
Ashutosh Sharma.

#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Sharma (#1)
Re: How can end users know the cause of LR slot sync delays?

On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

We have seen cases where slot synchronization gets delayed, for example when the slot is behind the failover standby or vice versa, and the slot sync worker has to wait for one to catch up with the other. During this waiting period, users querying pg_replication_slots can only see whether the slot has been synchronized or not. If it has already synchronized, that’s fine, but if synchronization is taking longer, users would naturally want to understand the reason for the delay.

Is there a way for end users to know the cause of slot synchronization delays, so they can take appropriate actions to speed it up?

I understand that server logs are emitted in such cases, but logs are not something end users would want to check regularly. Moreover, since logging is configuration-based, relevant messages may sometimes be skipped or suppressed.

Currently, the way to see the reason for sync skip is LOGs but I think
it is better to add a new column like sync_skip_reason in
pg_replication_slots. This can show the reasons like
standby_LSN_ahead_remote_LSN. I think ideally users can compare
standby's slot LSN/XMIN with remote_slot being synced. Do you have any
better ideas?

--
With Regards,
Amit Kapila.

#3Kirill Reshke
reshkekirill@gmail.com
In reply to: Amit Kapila (#2)
Re: How can end users know the cause of LR slot sync delays?

On Thu, 28 Aug 2025 at 14:56, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

We have seen cases where slot synchronization gets delayed, for example when the slot is behind the failover standby or vice versa, and the slot sync worker has to wait for one to catch up with the other. During this waiting period, users querying pg_replication_slots can only see whether the slot has been synchronized or not. If it has already synchronized, that’s fine, but if synchronization is taking longer, users would naturally want to understand the reason for the delay.

Is there a way for end users to know the cause of slot synchronization delays, so they can take appropriate actions to speed it up?

I understand that server logs are emitted in such cases, but logs are not something end users would want to check regularly. Moreover, since logging is configuration-based, relevant messages may sometimes be skipped or suppressed.

Currently, the way to see the reason for sync skip is LOGs but I think
it is better to add a new column like sync_skip_reason in
pg_replication_slots. This can show the reasons like
standby_LSN_ahead_remote_LSN. I think ideally users can compare
standby's slot LSN/XMIN with remote_slot being synced. Do you have any
better ideas?

How about something like pg_stat_progress_replication_slot with remote
LSN/standby LSN/catalog XID etc?
Wouldn't this be in sync with all other debug pg_stat_progress* views
and thus more Postgres-y?

--
Best regards,
Kirill Reshke

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Kirill Reshke (#3)
Re: How can end users know the cause of LR slot sync delays?

On Thu, Aug 28, 2025 at 3:29 PM Kirill Reshke <reshkekirill@gmail.com> wrote:

On Thu, 28 Aug 2025 at 14:56, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

We have seen cases where slot synchronization gets delayed, for example when the slot is behind the failover standby or vice versa, and the slot sync worker has to wait for one to catch up with the other. During this waiting period, users querying pg_replication_slots can only see whether the slot has been synchronized or not. If it has already synchronized, that’s fine, but if synchronization is taking longer, users would naturally want to understand the reason for the delay.

Is there a way for end users to know the cause of slot synchronization delays, so they can take appropriate actions to speed it up?

I understand that server logs are emitted in such cases, but logs are not something end users would want to check regularly. Moreover, since logging is configuration-based, relevant messages may sometimes be skipped or suppressed.

Currently, the way to see the reason for sync skip is LOGs but I think
it is better to add a new column like sync_skip_reason in
pg_replication_slots. This can show the reasons like
standby_LSN_ahead_remote_LSN. I think ideally users can compare
standby's slot LSN/XMIN with remote_slot being synced. Do you have any
better ideas?

How about something like pg_stat_progress_replication_slot with remote
LSN/standby LSN/catalog XID etc?
Wouldn't this be in sync with all other debug pg_stat_progress* views
and thus more Postgres-y?

Yes, that is another option. I am a little worried that it is not
always the sync lags behind, so having a separate view just for sync
progress may be too much. Yet another option is existing view
pg_stat_replication_slots but it seems sync progress doesn't directly
match there. For example, we can add a counter sync_skipped, time of
last sync_skip, and last_sync_skip_reason that could be sufficient to
dig the problem further.

--
With Regards,
Amit Kapila.

#5Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Amit Kapila (#2)
Re: How can end users know the cause of LR slot sync delays?

Hi Amit,

On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

We have seen cases where slot synchronization gets delayed, for example

when the slot is behind the failover standby or vice versa, and the slot
sync worker has to wait for one to catch up with the other. During this
waiting period, users querying pg_replication_slots can only see whether
the slot has been synchronized or not. If it has already synchronized,
that’s fine, but if synchronization is taking longer, users would naturally
want to understand the reason for the delay.

Is there a way for end users to know the cause of slot synchronization

delays, so they can take appropriate actions to speed it up?

I understand that server logs are emitted in such cases, but logs are

not something end users would want to check regularly. Moreover, since
logging is configuration-based, relevant messages may sometimes be skipped
or suppressed.

Currently, the way to see the reason for sync skip is LOGs but I think
it is better to add a new column like sync_skip_reason in
pg_replication_slots. This can show the reasons like
standby_LSN_ahead_remote_LSN. I think ideally users can compare
standby's slot LSN/XMIN with remote_slot being synced. Do you have any
better ideas?

I have similar thoughts, but for clarity, I’d like to outline some of the
key steps I plan to take:

Step 1: Define an enum for all possible reasons a slot persistence was
skipped.

/*
* Reasons why a replication slot sync was skipped.
*/
typedef enum ReplicationSlotSyncSkipReason
{
RS_SYNC_SKIP_NONE = 0, /* No skip */

RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind local
reserved LSN */

RS_SYNC_SKIP_DATA_LOSS = (1 << 1), /* Local slot ahead of remote,
risk of data loss */

RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2) /* Standby could not build a
consistent snapshot */
} ReplicationSlotSyncSkipReason;

--

Step 2: Introduce new column to pg_replication_slots to store the skip
reason

/* Inside pg_replication_slots table */
ReplicationSlotSyncSkipReason slot_sync_skip_reason;

--

Step 3: Function to convert enum to human-readable string that can be
stored in pg_replication_slots.

/*
* Convert ReplicationSlotSyncSkipReason bitmask to human-readable string.
*
* Returns a palloc'd string; caller is responsible for freeing it.
*/
static char *
replication_slot_sync_skip_reason_str(ReplicationSlotSyncSkipReason reason)
{
StringInfoData buf;
initStringInfo(&buf);

if (reason == RS_SYNC_SKIP_NONE)
{
appendStringInfoString(&buf, "none");
return buf.data;
}

if (reason & RS_SYNC_SKIP_REMOTE_BEHIND)
appendStringInfoString(&buf, "remote_behind|");
if (reason & RS_SYNC_SKIP_DATA_LOSS)
appendStringInfoString(&buf, "data_loss|");
if (reason & RS_SYNC_SKIP_NO_SNAPSHOT)
appendStringInfoString(&buf, "no_snapshot|");

/* Remove trailing '|' */
if (buf.len > 0 && buf.data[buf.len - 1] == '|')
buf.data[buf.len - 1] = '\0';

return buf.data;
}

--

Step 4: Capture slot_sync_skip_reason whenever the relevant LOG messages
are generated, primarily inside update_local_synced_slot or
update_and_persist_local_synced_slot. This value will can later be
persisted in the pg_replication_slots catalog.

--

Please let me know if you have any objections. I’ll share the wip patch in
a few days.

--
With Regards,
Ashutosh Sharma.

#6Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Ashutosh Sharma (#5)
Re: How can end users know the cause of LR slot sync delays?

Hi Ashutosh,

I am also interested in this thread. And was working on a patch for it.

On Wed, 3 Sept 2025 at 17:52, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi Amit,

On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

We have seen cases where slot synchronization gets delayed, for example when the slot is behind the failover standby or vice versa, and the slot sync worker has to wait for one to catch up with the other. During this waiting period, users querying pg_replication_slots can only see whether the slot has been synchronized or not. If it has already synchronized, that’s fine, but if synchronization is taking longer, users would naturally want to understand the reason for the delay.

Is there a way for end users to know the cause of slot synchronization delays, so they can take appropriate actions to speed it up?

I understand that server logs are emitted in such cases, but logs are not something end users would want to check regularly. Moreover, since logging is configuration-based, relevant messages may sometimes be skipped or suppressed.

Currently, the way to see the reason for sync skip is LOGs but I think
it is better to add a new column like sync_skip_reason in
pg_replication_slots. This can show the reasons like
standby_LSN_ahead_remote_LSN. I think ideally users can compare
standby's slot LSN/XMIN with remote_slot being synced. Do you have any
better ideas?

I have similar thoughts, but for clarity, I’d like to outline some of the key steps I plan to take:

Step 1: Define an enum for all possible reasons a slot persistence was skipped.

/*
* Reasons why a replication slot sync was skipped.
*/
typedef enum ReplicationSlotSyncSkipReason
{
RS_SYNC_SKIP_NONE = 0, /* No skip */

RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind local reserved LSN */

RS_SYNC_SKIP_DATA_LOSS = (1 << 1), /* Local slot ahead of remote, risk of data loss */

RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2) /* Standby could not build a consistent snapshot */
} ReplicationSlotSyncSkipReason;

--

I think we should also add the case when "remote_slot->confirmed_lsn >
latestFlushPtr" (WAL corresponding to the confirmed lsn on remote slot
is still not flushed on the Standby). In this case as well we are
skipping the slot sync.

Step 2: Introduce new column to pg_replication_slots to store the skip reason

/* Inside pg_replication_slots table */
ReplicationSlotSyncSkipReason slot_sync_skip_reason;

--

As per the discussion [1], I think it is more of stat related data and
we should add it in the pg_stat_replication_slots view. Also we can
add columns for 'slot sync skip count' and 'last slot sync skip'.
Thoughts?

Step 3: Function to convert enum to human-readable string that can be stored in pg_replication_slots.

/*
* Convert ReplicationSlotSyncSkipReason bitmask to human-readable string.
*
* Returns a palloc'd string; caller is responsible for freeing it.
*/
static char *
replication_slot_sync_skip_reason_str(ReplicationSlotSyncSkipReason reason)
{
StringInfoData buf;
initStringInfo(&buf);

if (reason == RS_SYNC_SKIP_NONE)
{
appendStringInfoString(&buf, "none");
return buf.data;
}

if (reason & RS_SYNC_SKIP_REMOTE_BEHIND)
appendStringInfoString(&buf, "remote_behind|");
if (reason & RS_SYNC_SKIP_DATA_LOSS)
appendStringInfoString(&buf, "data_loss|");
if (reason & RS_SYNC_SKIP_NO_SNAPSHOT)
appendStringInfoString(&buf, "no_snapshot|");

/* Remove trailing '|' */
if (buf.len > 0 && buf.data[buf.len - 1] == '|')
buf.data[buf.len - 1] = '\0';

return buf.data;
}

--

Why are we showing the cause of the slot sync delay as an aggregate of
all causes occuring? I thought we should show the reason for the last
slot sync delay?

Step 4: Capture slot_sync_skip_reason whenever the relevant LOG messages are generated, primarily inside update_local_synced_slot or update_and_persist_local_synced_slot. This value will can later be persisted in the pg_replication_slots catalog.

--

Please let me know if you have any objections. I’ll share the wip patch in a few days.

--

I have attached a patch which I have worked on.

Thanks,
Shlok Kyal

Attachments:

v1-0001-Add-stats-related-to-slot-sync-skip.patchapplication/octet-stream; name=v1-0001-Add-stats-related-to-slot-sync-skip.patchDownload+144-13
#7Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Shlok Kyal (#6)
Re: How can end users know the cause of LR slot sync delays?

Hi Shlok,

Good to hear that you’re also interested in working on this task.

On Thu, Sep 4, 2025 at 8:26 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

Hi Ashutosh,

I am also interested in this thread. And was working on a patch for it.

On Wed, 3 Sept 2025 at 17:52, Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

Hi Amit,

On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <ashu.coek88@gmail.com>

wrote:

We have seen cases where slot synchronization gets delayed, for

example when the slot is behind the failover standby or vice versa, and the
slot sync worker has to wait for one to catch up with the other. During
this waiting period, users querying pg_replication_slots can only see
whether the slot has been synchronized or not. If it has already
synchronized, that’s fine, but if synchronization is taking longer, users
would naturally want to understand the reason for the delay.

Is there a way for end users to know the cause of slot

synchronization delays, so they can take appropriate actions to speed it up?

I understand that server logs are emitted in such cases, but logs are

not something end users would want to check regularly. Moreover, since
logging is configuration-based, relevant messages may sometimes be skipped
or suppressed.

Currently, the way to see the reason for sync skip is LOGs but I think
it is better to add a new column like sync_skip_reason in
pg_replication_slots. This can show the reasons like
standby_LSN_ahead_remote_LSN. I think ideally users can compare
standby's slot LSN/XMIN with remote_slot being synced. Do you have any
better ideas?

I have similar thoughts, but for clarity, I’d like to outline some of

the key steps I plan to take:

Step 1: Define an enum for all possible reasons a slot persistence was

skipped.

/*
* Reasons why a replication slot sync was skipped.
*/
typedef enum ReplicationSlotSyncSkipReason
{
RS_SYNC_SKIP_NONE = 0, /* No skip */

RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind

local reserved LSN */

RS_SYNC_SKIP_DATA_LOSS = (1 << 1), /* Local slot ahead of

remote, risk of data loss */

RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2) /* Standby could not build a

consistent snapshot */

} ReplicationSlotSyncSkipReason;

--

I think we should also add the case when "remote_slot->confirmed_lsn >
latestFlushPtr" (WAL corresponding to the confirmed lsn on remote slot
is still not flushed on the Standby). In this case as well we are
skipping the slot sync.

Yes, we can include this case as well.

Step 2: Introduce new column to pg_replication_slots to store the skip

reason

/* Inside pg_replication_slots table */
ReplicationSlotSyncSkipReason slot_sync_skip_reason;

--

As per the discussion [1], I think it is more of stat related data and
we should add it in the pg_stat_replication_slots view. Also we can
add columns for 'slot sync skip count' and 'last slot sync skip'.
Thoughts?

It’s not a bad choice, but what makes it a bit confusing for me is that
some of the slot sync information is stored in pg_replication_slots, while
some is in pg_stat_replication_slots.

Is there a possibility that when an end user queries pg_replication_slots,
it shows a particular slot as synced, but querying
pg_stat_replication_slots instead reveals a sync skip reason, or the other
way around?

Moreover, these views are primary data sources for end users, and the
information is critical for their operations. Splitting related information
across multiple views could increase the complexity of their queries.

Step 3: Function to convert enum to human-readable string that can be

stored in pg_replication_slots.

/*
* Convert ReplicationSlotSyncSkipReason bitmask to human-readable

string.

*
* Returns a palloc'd string; caller is responsible for freeing it.
*/
static char *
replication_slot_sync_skip_reason_str(ReplicationSlotSyncSkipReason

reason)

{
StringInfoData buf;
initStringInfo(&buf);

if (reason == RS_SYNC_SKIP_NONE)
{
appendStringInfoString(&buf, "none");
return buf.data;
}

if (reason & RS_SYNC_SKIP_REMOTE_BEHIND)
appendStringInfoString(&buf, "remote_behind|");
if (reason & RS_SYNC_SKIP_DATA_LOSS)
appendStringInfoString(&buf, "data_loss|");
if (reason & RS_SYNC_SKIP_NO_SNAPSHOT)
appendStringInfoString(&buf, "no_snapshot|");

/* Remove trailing '|' */
if (buf.len > 0 && buf.data[buf.len - 1] == '|')
buf.data[buf.len - 1] = '\0';

return buf.data;
}

--

Why are we showing the cause of the slot sync delay as an aggregate of
all causes occuring? I thought we should show the reason for the last
slot sync delay?

Yes we should just be showing the reason for the last sync skip, no
aggregation is needed here.

Step 4: Capture slot_sync_skip_reason whenever the relevant LOG messages

are generated, primarily inside update_local_synced_slot or
update_and_persist_local_synced_slot. This value will can later be
persisted in the pg_replication_slots catalog.

--

Please let me know if you have any objections. I’ll share the wip patch

in a few days.

--

I have attached a patch which I have worked on.

Thanks, I will look into it, in fact I have already looked into it, but
before I make any comments, I think maybe we should try to finalize the
approach first.

--
With Regards,
Ashutosh Sharma.

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Sharma (#7)
Re: How can end users know the cause of LR slot sync delays?

On Fri, Sep 5, 2025 at 12:50 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Good to hear that you’re also interested in working on this task.

On Thu, Sep 4, 2025 at 8:26 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

Hi Ashutosh,

I am also interested in this thread. And was working on a patch for it.

On Wed, 3 Sept 2025 at 17:52, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi Amit,

On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

We have seen cases where slot synchronization gets delayed, for example when the slot is behind the failover standby or vice versa, and the slot sync worker has to wait for one to catch up with the other. During this waiting period, users querying pg_replication_slots can only see whether the slot has been synchronized or not. If it has already synchronized, that’s fine, but if synchronization is taking longer, users would naturally want to understand the reason for the delay.

Is there a way for end users to know the cause of slot synchronization delays, so they can take appropriate actions to speed it up?

I understand that server logs are emitted in such cases, but logs are not something end users would want to check regularly. Moreover, since logging is configuration-based, relevant messages may sometimes be skipped or suppressed.

Currently, the way to see the reason for sync skip is LOGs but I think
it is better to add a new column like sync_skip_reason in
pg_replication_slots. This can show the reasons like
standby_LSN_ahead_remote_LSN. I think ideally users can compare
standby's slot LSN/XMIN with remote_slot being synced. Do you have any
better ideas?

I have similar thoughts, but for clarity, I’d like to outline some of the key steps I plan to take:

Step 1: Define an enum for all possible reasons a slot persistence was skipped.

/*
* Reasons why a replication slot sync was skipped.
*/
typedef enum ReplicationSlotSyncSkipReason
{
RS_SYNC_SKIP_NONE = 0, /* No skip */

RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind local reserved LSN */

RS_SYNC_SKIP_DATA_LOSS = (1 << 1), /* Local slot ahead of remote, risk of data loss */

RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2) /* Standby could not build a consistent snapshot */
} ReplicationSlotSyncSkipReason;

--

I think we should also add the case when "remote_slot->confirmed_lsn >
latestFlushPtr" (WAL corresponding to the confirmed lsn on remote slot
is still not flushed on the Standby). In this case as well we are
skipping the slot sync.

Yes, we can include this case as well.

Step 2: Introduce new column to pg_replication_slots to store the skip reason

/* Inside pg_replication_slots table */
ReplicationSlotSyncSkipReason slot_sync_skip_reason;

--

As per the discussion [1], I think it is more of stat related data and
we should add it in the pg_stat_replication_slots view. Also we can
add columns for 'slot sync skip count' and 'last slot sync skip'.
Thoughts?

It’s not a bad choice, but what makes it a bit confusing for me is that some of the slot sync information is stored in pg_replication_slots, while some is in pg_stat_replication_slots.

How about keeping sync_skip_reason in pg_replication_slots and
sync_skip_count in pg_stat_replication_slots?

--
With Regards,
Amit Kapila.

#9Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Amit Kapila (#8)
Re: How can end users know the cause of LR slot sync delays?

Hi Amit,

On Sat, Sep 6, 2025 at 10:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 5, 2025 at 12:50 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Good to hear that you’re also interested in working on this task.

On Thu, Sep 4, 2025 at 8:26 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

Hi Ashutosh,

I am also interested in this thread. And was working on a patch for it.

On Wed, 3 Sept 2025 at 17:52, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi Amit,

On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

We have seen cases where slot synchronization gets delayed, for example when the slot is behind the failover standby or vice versa, and the slot sync worker has to wait for one to catch up with the other. During this waiting period, users querying pg_replication_slots can only see whether the slot has been synchronized or not. If it has already synchronized, that’s fine, but if synchronization is taking longer, users would naturally want to understand the reason for the delay.

Is there a way for end users to know the cause of slot synchronization delays, so they can take appropriate actions to speed it up?

I understand that server logs are emitted in such cases, but logs are not something end users would want to check regularly. Moreover, since logging is configuration-based, relevant messages may sometimes be skipped or suppressed.

Currently, the way to see the reason for sync skip is LOGs but I think
it is better to add a new column like sync_skip_reason in
pg_replication_slots. This can show the reasons like
standby_LSN_ahead_remote_LSN. I think ideally users can compare
standby's slot LSN/XMIN with remote_slot being synced. Do you have any
better ideas?

I have similar thoughts, but for clarity, I’d like to outline some of the key steps I plan to take:

Step 1: Define an enum for all possible reasons a slot persistence was skipped.

/*
* Reasons why a replication slot sync was skipped.
*/
typedef enum ReplicationSlotSyncSkipReason
{
RS_SYNC_SKIP_NONE = 0, /* No skip */

RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind local reserved LSN */

RS_SYNC_SKIP_DATA_LOSS = (1 << 1), /* Local slot ahead of remote, risk of data loss */

RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2) /* Standby could not build a consistent snapshot */
} ReplicationSlotSyncSkipReason;

--

I think we should also add the case when "remote_slot->confirmed_lsn >
latestFlushPtr" (WAL corresponding to the confirmed lsn on remote slot
is still not flushed on the Standby). In this case as well we are
skipping the slot sync.

Yes, we can include this case as well.

Step 2: Introduce new column to pg_replication_slots to store the skip reason

/* Inside pg_replication_slots table */
ReplicationSlotSyncSkipReason slot_sync_skip_reason;

--

As per the discussion [1], I think it is more of stat related data and
we should add it in the pg_stat_replication_slots view. Also we can
add columns for 'slot sync skip count' and 'last slot sync skip'.
Thoughts?

It’s not a bad choice, but what makes it a bit confusing for me is that some of the slot sync information is stored in pg_replication_slots, while some is in pg_stat_replication_slots.

How about keeping sync_skip_reason in pg_replication_slots and
sync_skip_count in pg_stat_replication_slots?

I think we can do that, since sync_skip_reason appears to be a
descriptive metadata rather than statistical data like
slot_sync_skip_count and last_slot_sync_skip. However, it's also true
that all three pieces of data are transient by nature - they will just
be present in the runtime.

--
With Regards,
Ashutosh Sharma.

#10Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Ashutosh Sharma (#9)
Re: How can end users know the cause of LR slot sync delays?

Hi Amit,

On Mon, Sep 8, 2025 at 9:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi Amit,

On Sat, Sep 6, 2025 at 10:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 5, 2025 at 12:50 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Good to hear that you’re also interested in working on this task.

On Thu, Sep 4, 2025 at 8:26 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

Hi Ashutosh,

I am also interested in this thread. And was working on a patch for it.

On Wed, 3 Sept 2025 at 17:52, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi Amit,

On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

We have seen cases where slot synchronization gets delayed, for example when the slot is behind the failover standby or vice versa, and the slot sync worker has to wait for one to catch up with the other. During this waiting period, users querying pg_replication_slots can only see whether the slot has been synchronized or not. If it has already synchronized, that’s fine, but if synchronization is taking longer, users would naturally want to understand the reason for the delay.

Is there a way for end users to know the cause of slot synchronization delays, so they can take appropriate actions to speed it up?

I understand that server logs are emitted in such cases, but logs are not something end users would want to check regularly. Moreover, since logging is configuration-based, relevant messages may sometimes be skipped or suppressed.

Currently, the way to see the reason for sync skip is LOGs but I think
it is better to add a new column like sync_skip_reason in
pg_replication_slots. This can show the reasons like
standby_LSN_ahead_remote_LSN. I think ideally users can compare
standby's slot LSN/XMIN with remote_slot being synced. Do you have any
better ideas?

I have similar thoughts, but for clarity, I’d like to outline some of the key steps I plan to take:

Step 1: Define an enum for all possible reasons a slot persistence was skipped.

/*
* Reasons why a replication slot sync was skipped.
*/
typedef enum ReplicationSlotSyncSkipReason
{
RS_SYNC_SKIP_NONE = 0, /* No skip */

RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind local reserved LSN */

RS_SYNC_SKIP_DATA_LOSS = (1 << 1), /* Local slot ahead of remote, risk of data loss */

RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2) /* Standby could not build a consistent snapshot */
} ReplicationSlotSyncSkipReason;

--

I think we should also add the case when "remote_slot->confirmed_lsn >
latestFlushPtr" (WAL corresponding to the confirmed lsn on remote slot
is still not flushed on the Standby). In this case as well we are
skipping the slot sync.

Yes, we can include this case as well.

Step 2: Introduce new column to pg_replication_slots to store the skip reason

/* Inside pg_replication_slots table */
ReplicationSlotSyncSkipReason slot_sync_skip_reason;

--

As per the discussion [1], I think it is more of stat related data and
we should add it in the pg_stat_replication_slots view. Also we can
add columns for 'slot sync skip count' and 'last slot sync skip'.
Thoughts?

It’s not a bad choice, but what makes it a bit confusing for me is that some of the slot sync information is stored in pg_replication_slots, while some is in pg_stat_replication_slots.

How about keeping sync_skip_reason in pg_replication_slots and
sync_skip_count in pg_stat_replication_slots?

I think we can do that, since sync_skip_reason appears to be a
descriptive metadata rather than statistical data like
slot_sync_skip_count and last_slot_sync_skip. However, it's also true
that all three pieces of data are transient by nature - they will just
be present in the runtime.

After spending some more time on this, I found that maintaining
sync_skip_reason in pg_replication_slots would make the code changes a
bit messy and harder to maintain. I think storing all 3 pieces of
information - sync_skip_reason, sync_skip_count, and last_sync_skip in
pg_stat_replication_slots would be a cleaner solution. This way, all
the sync-related info stays together and the code remains
straightforward.

@Sholk, do let me know if you agree with this?

--
With Regards,
Ashutosh Sharma.

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Sharma (#10)
Re: How can end users know the cause of LR slot sync delays?

On Fri, Sep 12, 2025 at 1:07 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Mon, Sep 8, 2025 at 9:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

I think we can do that, since sync_skip_reason appears to be a
descriptive metadata rather than statistical data like
slot_sync_skip_count and last_slot_sync_skip. However, it's also true
that all three pieces of data are transient by nature - they will just
be present in the runtime.

After spending some more time on this, I found that maintaining
sync_skip_reason in pg_replication_slots would make the code changes a
bit messy and harder to maintain.

What exactly is your worry? It seems more logical to have
sync_skip_reason in pg_replication_slots and other two in
pg_stat_replication_slots as the latter is purely a stats view and the
sync_skip_count/last_sync_skip suits there better.

I think storing all 3 pieces of

information - sync_skip_reason, sync_skip_count, and last_sync_skip in
pg_stat_replication_slots would be a cleaner solution. This way, all
the sync-related info stays together and the code remains
straightforward.

Having all the sync information together has merit but don't you think
in this case the sync_skip_reason doesn't seem to be matching with the
existing columns in pg_stat_replication_slots?

--
With Regards,
Amit Kapila.

#12Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Amit Kapila (#11)
Re: How can end users know the cause of LR slot sync delays?

Hi Amit,

On Fri, Sep 12, 2025 at 4:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 12, 2025 at 1:07 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Mon, Sep 8, 2025 at 9:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

I think we can do that, since sync_skip_reason appears to be a
descriptive metadata rather than statistical data like
slot_sync_skip_count and last_slot_sync_skip. However, it's also true
that all three pieces of data are transient by nature - they will just
be present in the runtime.

After spending some more time on this, I found that maintaining
sync_skip_reason in pg_replication_slots would make the code changes a
bit messy and harder to maintain.

What exactly is your worry? It seems more logical to have
sync_skip_reason in pg_replication_slots and other two in
pg_stat_replication_slots as the latter is purely a stats view and the
sync_skip_count/last_sync_skip suits there better.

The code changes for adding the skip reason to pg_replication_slots
feel a bit hacky compared to the approach for incorporating all three
pieces of information in pg_stat_replication_slots. I thought many
might prefer simplicity over hackiness, which is why having everything
in pg_stat_replication_slots could be more acceptable. That said, we
could maybe prepare a POC patch with this approach as well, compare
the two, and then decide which path to take.

--
With Regards,
Ashutosh Sharma.

#13Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Ashutosh Sharma (#12)
Re: How can end users know the cause of LR slot sync delays?

Hi Amit, Ashutosh,

On Fri, 12 Sept 2025 at 17:28, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi Amit,

On Fri, Sep 12, 2025 at 4:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 12, 2025 at 1:07 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Mon, Sep 8, 2025 at 9:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

I think we can do that, since sync_skip_reason appears to be a
descriptive metadata rather than statistical data like
slot_sync_skip_count and last_slot_sync_skip. However, it's also true
that all three pieces of data are transient by nature - they will just
be present in the runtime.

After spending some more time on this, I found that maintaining
sync_skip_reason in pg_replication_slots would make the code changes a
bit messy and harder to maintain.

What exactly is your worry? It seems more logical to have
sync_skip_reason in pg_replication_slots and other two in
pg_stat_replication_slots as the latter is purely a stats view and the
sync_skip_count/last_sync_skip suits there better.

The code changes for adding the skip reason to pg_replication_slots
feel a bit hacky compared to the approach for incorporating all three
pieces of information in pg_stat_replication_slots. I thought many
might prefer simplicity over hackiness, which is why having everything
in pg_stat_replication_slots could be more acceptable. That said, we
could maybe prepare a POC patch with this approach as well, compare
the two, and then decide which path to take.

Here are my thought on this:

I believe that if we decide to keep skip_reason in
pg_stat_replication_slot, it should mean "reason of last slot sync
skip" as I think it would make more sense in the sense of statistics.
And if we decide to keep skip_reason in pg_replication_slot, it would
be more appropriate to keep the latest slot data ( It should display
skip_reason only if the current slot sync cycle is skipped).
This is my observation based on the behaviour of current columns in
these views. Thoughts?

I have also attached POC patches for both approaches as per discussion above.
v2_approach1 : It adds all columns 'slot_sync_skip_reason',
'slot_sync_skip_count' and 'last_slot_sync_skip' to
pg_stat_replication_slots
v2_approach2 : It adds column 'slot_sync_skip_reason' to
pg_replication_slots and columns 'slot_sync_skip_count' and
'last_slot_sync_skip' to pg_stat_replication_slots

Thanks,
Shlok Kyal

Attachments:

v2_approach1-0001-Add-stats-related-to-slot-sync-skip.patchapplication/octet-stream; name=v2_approach1-0001-Add-stats-related-to-slot-sync-skip.patchDownload+143-16
v2_approach2-0001-Add-stats-related-to-slot-sync-skip.patchapplication/octet-stream; name=v2_approach2-0001-Add-stats-related-to-slot-sync-skip.patchDownload+188-27
#14Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Shlok Kyal (#13)
RE: How can end users know the cause of LR slot sync delays?

Dear Shlok,

Thanks for creating the patch. Personally I prefer approach2; approach1 cannot
indicate the current status of synchronization, it just shows the history.
I feel approach2 has more information than approach1.

Below are my comments for your patch.

01.
```
+ Number of times the slot sync is skipped.
```

"slot sync" should be "slot synchronization". Same thing can be saied for other
attributes.

02.
```
+ Reason of the last slot sync skip.
```

Possible values must be clarified.

03.
```
+            s.slot_sync_skip_count,
+            s.last_slot_sync_skip,
+                       s.slot_sync_skip_reason,
```

Last line has tab-blank but others have space-blank.

04.
```
+typedef enum SlotSyncSkipReason
+{
+       SLOT_SYNC_SKIP_NONE,            /* No skip */
+       SLOT_SYNC_SKIP_STANDBY_BEHIND,  /* Standby is behind the remote slot */
+       SLOT_SYNC_SKIP_REMOTE_BEHIND,   /* Remote slot is behind the local slot */
+       SLOT_SYNC_SKIP_NO_CONSISTENT_SNAPSHOT   /* Standby could not reach a
+                                                                                        * consistent snapshot */
+}                      SlotSyncSkipReason
```

a.
Can we add comment atop the enum?

b.
SLOT_SYNC_SKIP_STANDBY_BEHIND is misleading; it indicates the standby server has
not received WALs required by slots, but enum does not clarify it.
How about SLOT_SYNC_SKIP_MISSING_WAL_RECORDS or something? Better naming are very
welcome.

c
s/reach/build/.

05.
```
@@ -646,11 +652,13 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
remote_slot->name,
LSN_FORMAT_ARGS(latestFlushPtr)));

-               return false;
+               /* If the slot is not present on the local */
+               if (!(slot = SearchNamedReplicationSlot(remote_slot->name, true)))
+                       return false;
        }
```

Looks like if users try to sync slots via SQL interfaces, the statistics cannot
be updated.

06. update_slot_sync_skip_stats
```
+       /* Update the slot sync reason */
+       SpinLockAcquire(&slot->mutex);
+       slot->slot_sync_skip_reason = skip_reason;
+       SpinLockRelease(&slot->mutex);
```

I feel no need to update the reason if the slot->slot_sync_skip_reason
and skip_reason are the same.

07. synchronize_one_slot
```
+               /*
+                * If standby is behind remote slot and the synced slot is present on
+                * local.
+                */
+               if (remote_slot->confirmed_lsn > latestFlushPtr)
+               {
+                       if (synced)
+                               update_slot_sync_skip_stats(slot, skip_reason);
+                       return false;
+               }
```

This condition exist in the same function; can we combine?

08. GetSlotSyncSkipReason()

Do we have to do pstrdup() here? I found a similar function get_snapbuild_state_desc(),
and it does not use.

09.

Can you consider a test for added codes?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#14)
Re: How can end users know the cause of LR slot sync delays?

On Wed, Sep 17, 2025 at 4:24 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Shlok,

Thanks for creating the patch. Personally I prefer approach2; approach1 cannot
indicate the current status of synchronization, it just shows the history.
I feel approach2 has more information than approach1.

I also think so but Ashutosh thought that it would be hacky. Ashutosh,
did you have an opinion on this matter after seeing the patches?

--
With Regards,
Amit Kapila.

#16Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Amit Kapila (#15)
Re: How can end users know the cause of LR slot sync delays?

Hi Amit,

On Wed, Sep 17, 2025 at 5:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 17, 2025 at 4:24 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Shlok,

Thanks for creating the patch. Personally I prefer approach2; approach1 cannot
indicate the current status of synchronization, it just shows the history.
I feel approach2 has more information than approach1.

I also think so but Ashutosh thought that it would be hacky. Ashutosh,
did you have an opinion on this matter after seeing the patches?

Yes, I’ve looked into both the patches. Approach 1 seems quite
straightforward. In approach 2, we need to pass some additional
arguments to update_local_sync_slot and
update_and_persist_local_synced_slot, which makes it feel a little
less clean compared to approach 1, where we simply add a new function
and call it directly. That said, this is just my view on code
cleanliness, I’m fine with proceeding with approach 2 if that’s
considered the better option.

--
With Regards,
Ashutosh Sharma,

#17Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Sharma (#16)
Re: How can end users know the cause of LR slot sync delays?

On Wed, Sep 17, 2025 at 8:19 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Wed, Sep 17, 2025 at 5:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 17, 2025 at 4:24 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Shlok,

Thanks for creating the patch. Personally I prefer approach2; approach1 cannot
indicate the current status of synchronization, it just shows the history.
I feel approach2 has more information than approach1.

I also think so but Ashutosh thought that it would be hacky. Ashutosh,
did you have an opinion on this matter after seeing the patches?

Yes, I’ve looked into both the patches. Approach 1 seems quite
straightforward. In approach 2, we need to pass some additional
arguments to update_local_sync_slot and
update_and_persist_local_synced_slot, which makes it feel a little
less clean compared to approach 1, where we simply add a new function
and call it directly.

This is because the approach-1 doesn't show the latest value of
sync_status. I mean in the latest cycle if the sync is successful, it
won't update the stats which I am not sure is correct because users
may want to know the recent status of sync cycle. Otherwise, the patch
should be almost the same. I think we can even try to write a patch
for approach-2 without an additional out parameter in some of the
functions.

--
With Regards,
Amit Kapila.

#18Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Amit Kapila (#17)
Re: How can end users know the cause of LR slot sync delays?

Hi Amit,

On Thu, Sep 18, 2025 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 17, 2025 at 8:19 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Wed, Sep 17, 2025 at 5:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 17, 2025 at 4:24 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Shlok,

Thanks for creating the patch. Personally I prefer approach2; approach1 cannot
indicate the current status of synchronization, it just shows the history.
I feel approach2 has more information than approach1.

I also think so but Ashutosh thought that it would be hacky. Ashutosh,
did you have an opinion on this matter after seeing the patches?

Yes, I’ve looked into both the patches. Approach 1 seems quite
straightforward. In approach 2, we need to pass some additional
arguments to update_local_sync_slot and
update_and_persist_local_synced_slot, which makes it feel a little
less clean compared to approach 1, where we simply add a new function
and call it directly.

This is because the approach-1 doesn't show the latest value of
sync_status. I mean in the latest cycle if the sync is successful, it
won't update the stats which I am not sure is correct because users
may want to know the recent status of sync cycle. Otherwise, the patch
should be almost the same.

This should be manageable, no? If we add an additional call to the
stats report function immediately after ReplicationSlotPersist(),
wouldn’t that address the issue? Please correct me if I’m overlooking
something.

@@ -600,6 +600,8 @@ update_and_persist_local_synced_slot(RemoteSlot
*remote_slot, Oid remote_dbid)

ReplicationSlotPersist();

+       pgstat_report_replslot_sync_skip(slot, SLOT_SYNC_SKIP_NONE);
+
        ereport(LOG,
                        errmsg("newly created replication slot \"%s\"
is sync-ready now",
                                   remote_slot->name));

In addition to this, should anyone really need to query the skip
reason if pg_replication_slots already shows that the slot is synced
and not temporary? Ideally, users should check the slot status in
pg_replication_slots, and if it indicates the slot is persisted, there
seems little value in enquiring pg_stat_replication_slots for the skip
reason. That said, it’s important to ensure the information in both
views remains consistent.

I think we can even try to write a patch
for approach-2 without an additional out parameter in some of the
functions.

We can aim for this, if possible.

--
With Regards,
Ashutosh Sharma.

#19Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Shlok Kyal (#13)
Re: How can end users know the cause of LR slot sync delays?

Hi,

@@ -646,7 +670,11 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid
remote_dbid)
remote_slot->name,
LSN_FORMAT_ARGS(latestFlushPtr)));

- return false;
+ /* If slot is present on the local, update the slot sync skip stats */
+ if ((slot = SearchNamedReplicationSlot(remote_slot->name, true)))
+ skip_reason = SLOT_SYNC_SKIP_STANDBY_BEHIND;
+ else
+ return false;

With this change, you’re likely enforcing sync slot creation, whereas
earlier that might not have been the case. This introduces a
behavioral change, which may not be well received.

--

I think we can avoid passing skip_reason as a new argument to
update_local_synced_slot(). It only needs to be passed to
update_and_persist_local_synced_slot(). When
update_local_synced_slot() is invoked from within
update_and_persist_local_synced_slot(), we can simply rely on the two
flags, remote_slot_precedes and found_consistent_snapshot and set the
skip_reason accordingly, thoughts?

If update_local_synced_slot is being called from any other place that
means the slot is already persisted.

--

+typedef enum SlotSyncSkipReason
+{
+ SLOT_SYNC_SKIP_NONE, /* No skip */
+ SLOT_SYNC_SKIP_STANDBY_BEHIND, /* Standby is behind the remote slot */
+ SLOT_SYNC_SKIP_REMOTE_BEHIND, /* Remote slot is behind the local slot */
+ SLOT_SYNC_SKIP_NO_CONSISTENT_SNAPSHOT /* Standby could not reach a
+ * consistent snapshot */
+} SlotSyncSkipReason;
+

I would suggest shortening the enum names like maybe SS_SKIP_NONE
instead of SLOT_SYNC_SKIP_NONE.

--
With Regards,
Ashutosh Sharma.

#20Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Ashutosh Sharma (#18)
Re: How can end users know the cause of LR slot sync delays?

On Thu, 18 Sept 2025 at 13:17, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi Amit,

On Thu, Sep 18, 2025 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 17, 2025 at 8:19 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Wed, Sep 17, 2025 at 5:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 17, 2025 at 4:24 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Shlok,

Thanks for creating the patch. Personally I prefer approach2; approach1 cannot
indicate the current status of synchronization, it just shows the history.
I feel approach2 has more information than approach1.

I also think so but Ashutosh thought that it would be hacky. Ashutosh,
did you have an opinion on this matter after seeing the patches?

Yes, I’ve looked into both the patches. Approach 1 seems quite
straightforward. In approach 2, we need to pass some additional
arguments to update_local_sync_slot and
update_and_persist_local_synced_slot, which makes it feel a little
less clean compared to approach 1, where we simply add a new function
and call it directly.

This is because the approach-1 doesn't show the latest value of
sync_status. I mean in the latest cycle if the sync is successful, it
won't update the stats which I am not sure is correct because users
may want to know the recent status of sync cycle. Otherwise, the patch
should be almost the same.

This should be manageable, no? If we add an additional call to the
stats report function immediately after ReplicationSlotPersist(),
wouldn’t that address the issue? Please correct me if I’m overlooking
something.

@@ -600,6 +600,8 @@ update_and_persist_local_synced_slot(RemoteSlot
*remote_slot, Oid remote_dbid)

ReplicationSlotPersist();

+       pgstat_report_replslot_sync_skip(slot, SLOT_SYNC_SKIP_NONE);
+
ereport(LOG,
errmsg("newly created replication slot \"%s\"
is sync-ready now",
remote_slot->name));

Currently, in this code change, the skip_reason is updated to 'none'
only when the slot state changes from temporary to persistent.
However, this logic does not handle cases where the slot is already a
persistent sync slot.
I believe we should also sync skip can happen for persistent slots.
That means, on a successful slot sync, we should update the
skip_reason to 'none' even for slots that are already persistent.

In addition to this, should anyone really need to query the skip
reason if pg_replication_slots already shows that the slot is synced
and not temporary? Ideally, users should check the slot status in
pg_replication_slots, and if it indicates the slot is persisted, there
seems little value in enquiring pg_stat_replication_slots for the skip
reason. That said, it’s important to ensure the information in both
views remains consistent.

I have a doubt. Why don't we want to report the sync skip reason once
the slots are persisted?
for the case:

latestFlushPtr = GetStandbyFlushRecPtr(NULL);
if (remote_slot->confirmed_lsn > latestFlushPtr)
{
/*
* Can get here only if GUC 'synchronized_standby_slots' on the
* primary server was not configured correctly.
*/
ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("skipping slot synchronization because the
received slot sync"
" LSN %X/%08X for slot \"%s\" is ahead of the
standby position %X/%08X",
LSN_FORMAT_ARGS(remote_slot->confirmed_lsn),
remote_slot->name,
LSN_FORMAT_ARGS(latestFlushPtr)));

return false;
}

Slot sync skip can happen even for persistent slots. So why should we
avoid displaying the skip reason in such cases?
I checked that if the synchronized_standby_slots GUC is not set
properly, we can hit this condition even for persistent slots.
I think we should still display the skip reason if the user does not
configure this GUC as expected.

I think we can even try to write a patch
for approach-2 without an additional out parameter in some of the
functions.

We can aim for this, if possible.

Thanks,
Shlok Kyal

#21Amit Kapila
amit.kapila16@gmail.com
In reply to: Shlok Kyal (#20)
#22Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#14)
#23Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Amit Kapila (#17)
#24Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Ashutosh Sharma (#19)
#25Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Shlok Kyal (#22)
#26Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Shlok Kyal (#25)
#27Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#26)
#28Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#26)
#29Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#27)
#30Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Shlok Kyal (#28)
#31Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#30)
#32Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Shlok Kyal (#31)
#33shveta malik
shveta.malik@gmail.com
In reply to: Shlok Kyal (#31)
#34shveta malik
shveta.malik@gmail.com
In reply to: shveta malik (#33)
#35shveta malik
shveta.malik@gmail.com
In reply to: shveta malik (#34)
#36Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: shveta malik (#35)
#37Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#32)
#38Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Shlok Kyal (#36)
#39Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#38)
#40shveta malik
shveta.malik@gmail.com
In reply to: Shlok Kyal (#39)
#41Amit Kapila
amit.kapila16@gmail.com
In reply to: shveta malik (#40)
#42shveta malik
shveta.malik@gmail.com
In reply to: Amit Kapila (#41)
#43Amit Kapila
amit.kapila16@gmail.com
In reply to: shveta malik (#42)
#44Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: shveta malik (#42)
#45Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Amit Kapila (#41)
#46Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Amit Kapila (#43)
#47Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Shlok Kyal (#44)
#48Amit Kapila
amit.kapila16@gmail.com
In reply to: Shlok Kyal (#47)
#49shveta malik
shveta.malik@gmail.com
In reply to: Shlok Kyal (#47)
#50Amit Kapila
amit.kapila16@gmail.com
In reply to: shveta malik (#49)
#51Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#50)
#52Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Zhijie Hou (Fujitsu) (#51)
#53Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#52)
#54Amit Kapila
amit.kapila16@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#53)
#55Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#54)
#56Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Zhijie Hou (Fujitsu) (#55)
#57Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#56)
#58shveta malik
shveta.malik@gmail.com
In reply to: Shlok Kyal (#57)
#59Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: shveta malik (#58)
#60shveta malik
shveta.malik@gmail.com
In reply to: Shlok Kyal (#59)
#61Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Shlok Kyal (#59)
#62Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#61)
#63Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Shlok Kyal (#62)
#64shveta malik
shveta.malik@gmail.com
In reply to: Shlok Kyal (#63)
#65Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: shveta malik (#64)
#66Michael Banck
michael.banck@credativ.de
In reply to: shveta malik (#42)
#67Michael Banck
michael.banck@credativ.de
In reply to: Shlok Kyal (#65)
#68Amit Kapila
amit.kapila16@gmail.com
In reply to: Shlok Kyal (#31)
#69Michael Banck
michael.banck@credativ.de
In reply to: Amit Kapila (#68)
#70Amit Kapila
amit.kapila16@gmail.com
In reply to: shveta malik (#34)
#71Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#70)
#72Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Dilip Kumar (#71)
#73Michael Banck
michael.banck@credativ.de
In reply to: Shlok Kyal (#72)
#74Amit Kapila
amit.kapila16@gmail.com
In reply to: Shlok Kyal (#39)
#75Amit Kapila
amit.kapila16@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#55)