Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

Started by Fujii Masao3 months ago16 messages
#1Fujii Masao
masao.fujii@gmail.com
1 attachment(s)

Hi,

While testing, I noticed that write_lag and flush_lag in pg_stat_replication
initially advanced but eventually stopped updating. This happened when
I started pg_receivewal, ran pgbench, and periodically monitored
pg_stat_replication.

My analysis shows that this issue occurs when any of the write, flush,
or replay LSNs in the standby’s feedback message stop updating for some time.
In the case of pg_receivewal, the replay LSN is always invalid (never updated),
which triggers the problem. Similarly, in regular streaming replication,
if the replay LSN remains unchanged for a long time—such as during
a recovery conflict—the lag values for both write and flush can stop advancing.

The root cause seems to be that when any of the LSNs stop updating,
the lag tracker's cyclic buffer becomes full (the write head reaches
the slowest read head). In this situation, LagTrackerWrite() and
LagTrackerRead() didn't handle the full-buffer condition properly.
For instance, if the replay LSN stalls, the buffer fills up and the read heads
for "write" and "flush" end up at the same position as the write head.
This causes LagTrackerRead() to return -1 for both, preventing write_lag
and flush_lag from advancing.

The attached patch fixes the problem by treating the slowest read entry
(the one causing the buffer to fill up) as a separate overflow entry,
allowing the lag tracker to continue operating correctly.

Thoughts?

Regards,

--
Fujii Masao

Attachments:

v1-0001-Fix-lag-columns-in-pg_stat_replication-not-advanc.patchapplication/octet-stream; name=v1-0001-Fix-lag-columns-in-pg_stat_replication-not-advanc.patchDownload
From 47995afe3af628c5f33eba2dbdbabd50f9f7a4d7 Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Fri, 17 Oct 2025 12:52:18 +0900
Subject: [PATCH v1] Fix lag columns in pg_stat_replication not advancing when
 replay LSN stalls.

---
 src/backend/replication/walsender.c | 45 ++++++++++++++++++-----------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 59822f22b8d..d3ef136129f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -233,6 +233,7 @@ typedef struct
 	int			write_head;
 	int			read_heads[NUM_SYNC_REP_WAIT_MODE];
 	WalTimeSample last_read[NUM_SYNC_REP_WAIT_MODE];
+	WalTimeSample overflowed[NUM_SYNC_REP_WAIT_MODE];
 } LagTracker;
 
 static LagTracker *lag_tracker;
@@ -4207,7 +4208,6 @@ WalSndKeepaliveIfNecessary(void)
 static void
 LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 {
-	bool		buffer_full;
 	int			new_write_head;
 	int			i;
 
@@ -4229,25 +4229,19 @@ LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 	 * of space.
 	 */
 	new_write_head = (lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
-	buffer_full = false;
 	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
 	{
+		/*
+		 * If the buffer is full, move the slowest reader to a separate
+		 * overflow entry and free its space in the buffer so the write head
+		 * can advance.
+		 */
 		if (new_write_head == lag_tracker->read_heads[i])
-			buffer_full = true;
-	}
-
-	/*
-	 * If the buffer is full, for now we just rewind by one slot and overwrite
-	 * the last sample, as a simple (if somewhat uneven) way to lower the
-	 * sampling rate.  There may be better adaptive compaction algorithms.
-	 */
-	if (buffer_full)
-	{
-		new_write_head = lag_tracker->write_head;
-		if (lag_tracker->write_head > 0)
-			lag_tracker->write_head--;
-		else
-			lag_tracker->write_head = LAG_TRACKER_BUFFER_SIZE - 1;
+		{
+			lag_tracker->overflowed[i] =
+				lag_tracker->buffer[lag_tracker->read_heads[i]];
+			lag_tracker->read_heads[i] = -1;
+		}
 	}
 
 	/* Store a sample at the current write head position. */
@@ -4274,6 +4268,23 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 {
 	TimestampTz time = 0;
 
+	/*
+	 * If 'lsn' has not passed the WAL position stored in the overflow entry,
+	 * return the elapsed time (in microseconds) since the saved local flush
+	 * time. Otherwise, resume using the buffer to control the read head and
+	 * compute the elapsed time.
+	 */
+	if (lag_tracker->read_heads[head] == -1)
+	{
+		if (lag_tracker->overflowed[head].lsn > lsn)
+			return now - lag_tracker->overflowed[head].time;
+
+		time = lag_tracker->overflowed[head].time;
+		lag_tracker->last_read[head] = lag_tracker->overflowed[head];
+		lag_tracker->read_heads[head] =
+			(lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
+	}
+
 	/* Read all unread samples up to this LSN or end of buffer. */
 	while (lag_tracker->read_heads[head] != lag_tracker->write_head &&
 		   lag_tracker->buffer[lag_tracker->read_heads[head]].lsn <= lsn)
-- 
2.50.1

#2Chao Li
li.evan.chao@gmail.com
In reply to: Fujii Masao (#1)
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

On Oct 17, 2025, at 11:56, Fujii Masao <masao.fujii@gmail.com> wrote:

Hi,

While testing, I noticed that write_lag and flush_lag in pg_stat_replication
initially advanced but eventually stopped updating. This happened when
I started pg_receivewal, ran pgbench, and periodically monitored
pg_stat_replication.

My analysis shows that this issue occurs when any of the write, flush,
or replay LSNs in the standby’s feedback message stop updating for some time.
In the case of pg_receivewal, the replay LSN is always invalid (never updated),
which triggers the problem. Similarly, in regular streaming replication,
if the replay LSN remains unchanged for a long time—such as during
a recovery conflict—the lag values for both write and flush can stop advancing.

The root cause seems to be that when any of the LSNs stop updating,
the lag tracker's cyclic buffer becomes full (the write head reaches
the slowest read head). In this situation, LagTrackerWrite() and
LagTrackerRead() didn't handle the full-buffer condition properly.
For instance, if the replay LSN stalls, the buffer fills up and the read heads
for "write" and "flush" end up at the same position as the write head.
This causes LagTrackerRead() to return -1 for both, preventing write_lag
and flush_lag from advancing.

The attached patch fixes the problem by treating the slowest read entry
(the one causing the buffer to fill up) as a separate overflow entry,
allowing the lag tracker to continue operating correctly.

--
Fujii Masao
<v1-0001-Fix-lag-columns-in-pg_stat_replication-not-advanc.patch>

It took me some time to understand this fix. My most confusing was that once overwrite happens, how a reader head to catch up again? Finally I figured it out:

```
+		lag_tracker->read_heads[head] =
+			(lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
```

"(lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE” points to the oldest LSN in the ring, from where an overflowed reader head starts to catch up.

I have no comment on the code change. Nice patch!

All I wonder is if we can add a TAP test for this fix?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#3Fujii Masao
masao.fujii@gmail.com
In reply to: Chao Li (#2)
1 attachment(s)
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

On Fri, Oct 17, 2025 at 5:11 PM Chao Li <li.evan.chao@gmail.com> wrote:

It took me some time to understand this fix. My most confusing was that once overwrite happens, how a reader head to catch up again? Finally I figured it out:

```
+               lag_tracker->read_heads[head] =
+                       (lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
```

"(lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE” points to the oldest LSN in the ring, from where an overflowed reader head starts to catch up.

I have no comment on the code change. Nice patch!

Thanks for the review!

I've updated the source comment to make the code easier to understand.
The updated patch is attached.

All I wonder is if we can add a TAP test for this fix?

I think it would be good to add a test for this fix, but reproducing
the condition
where the buffer fills up and the slowest read entry overflows takes a time.
Because of that, I'm not sure adding such a potentially slow test is a
good idea.

Regards,

--
Fujii Masao

Attachments:

v2-0001-Fix-stalled-lag-columns-in-pg_stat_replication-wh.patchapplication/octet-stream; name=v2-0001-Fix-stalled-lag-columns-in-pg_stat_replication-wh.patchDownload
From ccbd179e58faf5fd816d85f134a783d51dcde3b8 Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Fri, 17 Oct 2025 20:22:40 +0900
Subject: [PATCH v2] Fix stalled lag columns in pg_stat_replication when replay
 LSN stops advancing.

Previously, when the replay LSN reported in feedback messages from a standby
stopped advancing, for example, due to a recovery conflict, the write_lag and
flush_lag columns in pg_stat_replication would initially update but then stop
progressing. This prevented users from correctly monitoring replication lag.

The problem occurred because when any LSN stopped updating, the lag tracker's
cyclic buffer became full (the write head reached the slowest read head).
In that state, the lag tracker could no longer compute round-trip lag values
correctly.

This commit fixes the issue by handling the slowest read entry (the one
causing the buffer to fill) as a separate overflow entry and freeing space
so the write and other read heads can continue advancing in the buffer.
As a result, write_lag and flush_lag now continue updating even if the reported
replay LSN remains stalled.

Backpatch to all supported versions.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/CAHGQGwGdGQ=1-X-71Caee-LREBUXSzyohkoQJd4yZZCMt24C0g@mail.gmail.com
Backpatch-through: 13
---
 src/backend/replication/walsender.c | 46 ++++++++++++++++++-----------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 59822f22b8d..a75c69b24c3 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -233,6 +233,7 @@ typedef struct
 	int			write_head;
 	int			read_heads[NUM_SYNC_REP_WAIT_MODE];
 	WalTimeSample last_read[NUM_SYNC_REP_WAIT_MODE];
+	WalTimeSample overflowed[NUM_SYNC_REP_WAIT_MODE];
 } LagTracker;
 
 static LagTracker *lag_tracker;
@@ -4207,7 +4208,6 @@ WalSndKeepaliveIfNecessary(void)
 static void
 LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 {
-	bool		buffer_full;
 	int			new_write_head;
 	int			i;
 
@@ -4229,25 +4229,19 @@ LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 	 * of space.
 	 */
 	new_write_head = (lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
-	buffer_full = false;
 	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
 	{
+		/*
+		 * If the buffer is full, move the slowest reader to a separate
+		 * overflow entry and free its space in the buffer so the write head
+		 * can advance.
+		 */
 		if (new_write_head == lag_tracker->read_heads[i])
-			buffer_full = true;
-	}
-
-	/*
-	 * If the buffer is full, for now we just rewind by one slot and overwrite
-	 * the last sample, as a simple (if somewhat uneven) way to lower the
-	 * sampling rate.  There may be better adaptive compaction algorithms.
-	 */
-	if (buffer_full)
-	{
-		new_write_head = lag_tracker->write_head;
-		if (lag_tracker->write_head > 0)
-			lag_tracker->write_head--;
-		else
-			lag_tracker->write_head = LAG_TRACKER_BUFFER_SIZE - 1;
+		{
+			lag_tracker->overflowed[i] =
+				lag_tracker->buffer[lag_tracker->read_heads[i]];
+			lag_tracker->read_heads[i] = -1;
+		}
 	}
 
 	/* Store a sample at the current write head position. */
@@ -4274,6 +4268,24 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 {
 	TimestampTz time = 0;
 
+	/*
+	 * If 'lsn' has not passed the WAL position stored in the overflow entry,
+	 * return the elapsed time (in microseconds) since the saved local flush
+	 * time. Otherwise, switch back to using the buffer to control the read
+	 * head and compute the elapsed time.  The read head is then reset to
+	 * point to the oldest entry in the buffer.
+	 */
+	if (lag_tracker->read_heads[head] == -1)
+	{
+		if (lag_tracker->overflowed[head].lsn > lsn)
+			return now - lag_tracker->overflowed[head].time;
+
+		time = lag_tracker->overflowed[head].time;
+		lag_tracker->last_read[head] = lag_tracker->overflowed[head];
+		lag_tracker->read_heads[head] =
+			(lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
+	}
+
 	/* Read all unread samples up to this LSN or end of buffer. */
 	while (lag_tracker->read_heads[head] != lag_tracker->write_head &&
 		   lag_tracker->buffer[lag_tracker->read_heads[head]].lsn <= lsn)
-- 
2.50.1

#4Chao Li
li.evan.chao@gmail.com
In reply to: Fujii Masao (#3)
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

On Oct 17, 2025, at 22:28, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Oct 17, 2025 at 5:11 PM Chao Li <li.evan.chao@gmail.com> wrote:

It took me some time to understand this fix. My most confusing was that once overwrite happens, how a reader head to catch up again? Finally I figured it out:

```
+               lag_tracker->read_heads[head] =
+                       (lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
```

"(lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE” points to the oldest LSN in the ring, from where an overflowed reader head starts to catch up.

I have no comment on the code change. Nice patch!

Thanks for the review!

I've updated the source comment to make the code easier to understand.
The updated patch is attached.

<v2-0001-Fix-stalled-lag-columns-in-pg_stat_replication-wh.patch>

Thanks for adding the comment.

I think I put all concentration on the big picture yesterday, so I ignored this implementation detail:

```
+		if (lag_tracker->overflowed[head].lsn > lsn)
+			return now - lag_tracker->overflowed[head].time;
```

Should this “>” be “>=“?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#5Shinya Kato
shinya11.kato@gmail.com
In reply to: Fujii Masao (#1)
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

Hi,

On Fri, Oct 17, 2025 at 12:57 PM Fujii Masao <masao.fujii@gmail.com> wrote:

Hi,

While testing, I noticed that write_lag and flush_lag in pg_stat_replication
initially advanced but eventually stopped updating. This happened when
I started pg_receivewal, ran pgbench, and periodically monitored
pg_stat_replication.

Nice catch! I reproduced the same issue.

My analysis shows that this issue occurs when any of the write, flush,
or replay LSNs in the standby’s feedback message stop updating for some time.
In the case of pg_receivewal, the replay LSN is always invalid (never updated),
which triggers the problem. Similarly, in regular streaming replication,
if the replay LSN remains unchanged for a long time—such as during
a recovery conflict—the lag values for both write and flush can stop advancing.

The root cause seems to be that when any of the LSNs stop updating,
the lag tracker's cyclic buffer becomes full (the write head reaches
the slowest read head). In this situation, LagTrackerWrite() and
LagTrackerRead() didn't handle the full-buffer condition properly.
For instance, if the replay LSN stalls, the buffer fills up and the read heads
for "write" and "flush" end up at the same position as the write head.
This causes LagTrackerRead() to return -1 for both, preventing write_lag
and flush_lag from advancing.

The attached patch fixes the problem by treating the slowest read entry
(the one causing the buffer to fill up) as a separate overflow entry,
allowing the lag tracker to continue operating correctly.

Thank you for the patch. I have one comment.

+       if (lag_tracker->overflowed[head].lsn > lsn)
+           return now - lag_tracker->overflowed[head].time;

Could this return a negative value if the clock somehow went
backwards? The original code returns -1 in this case, so I'm curious
about this.

--
Best regards,
Shinya Kato
NTT OSS Center

#6Fujii Masao
masao.fujii@gmail.com
In reply to: Chao Li (#4)
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

On Sat, Oct 18, 2025 at 9:16 AM Chao Li <li.evan.chao@gmail.com> wrote:

I think I put all concentration on the big picture yesterday, so I ignored this implementation detail:

```
+               if (lag_tracker->overflowed[head].lsn > lsn)
+                       return now - lag_tracker->overflowed[head].time;
```

Should this “>” be “>=“?

I think either ">" or ">=" would work. However, the following loop
in LagTrackerRead() advances the read head when the LSN
in the current buffer entry is equal to the given LSN, so I followed
the same logic for the overflow entry and used ">" there.

Regards,

--
Fujii Masao

#7Fujii Masao
masao.fujii@gmail.com
In reply to: Shinya Kato (#5)
1 attachment(s)
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

On Sun, Oct 19, 2025 at 2:04 AM Shinya Kato <shinya11.kato@gmail.com> wrote:

Thank you for the patch. I have one comment.

+       if (lag_tracker->overflowed[head].lsn > lsn)
+           return now - lag_tracker->overflowed[head].time;

Could this return a negative value if the clock somehow went
backwards? The original code returns -1 in this case, so I'm curious
about this.

Thanks for the review!

Yes, you're right. So I've updated the patch so that -1 is returned
when the current time is earlier than the time in the overflow entry,
treating it as "no new sample found".

Regards,

--
Fujii Masao

Attachments:

v3-0001-Fix-stalled-lag-columns-in-pg_stat_replication-wh.patchapplication/octet-stream; name=v3-0001-Fix-stalled-lag-columns-in-pg_stat_replication-wh.patchDownload
From 9f8868bce0077de0a200074184a887cdcfcca346 Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Mon, 20 Oct 2025 09:26:48 +0900
Subject: [PATCH v3] Fix stalled lag columns in pg_stat_replication when replay
 LSN stops advancing.

Previously, when the replay LSN reported in feedback messages from a standby
stopped advancing, for example, due to a recovery conflict, the write_lag and
flush_lag columns in pg_stat_replication would initially update but then stop
progressing. This prevented users from correctly monitoring replication lag.

The problem occurred because when any LSN stopped updating, the lag tracker's
cyclic buffer became full (the write head reached the slowest read head).
In that state, the lag tracker could no longer compute round-trip lag values
correctly.

This commit fixes the issue by handling the slowest read entry (the one
causing the buffer to fill) as a separate overflow entry and freeing space
so the write and other read heads can continue advancing in the buffer.
As a result, write_lag and flush_lag now continue updating even if the reported
replay LSN remains stalled.

Backpatch to all supported versions.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwGdGQ=1-X-71Caee-LREBUXSzyohkoQJd4yZZCMt24C0g@mail.gmail.com
Backpatch-through: 13
---
 src/backend/replication/walsender.c | 50 +++++++++++++++++++----------
 1 file changed, 33 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 59822f22b8d..1ce21a2ad98 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -233,6 +233,7 @@ typedef struct
 	int			write_head;
 	int			read_heads[NUM_SYNC_REP_WAIT_MODE];
 	WalTimeSample last_read[NUM_SYNC_REP_WAIT_MODE];
+	WalTimeSample overflowed[NUM_SYNC_REP_WAIT_MODE];
 } LagTracker;
 
 static LagTracker *lag_tracker;
@@ -4207,7 +4208,6 @@ WalSndKeepaliveIfNecessary(void)
 static void
 LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 {
-	bool		buffer_full;
 	int			new_write_head;
 	int			i;
 
@@ -4229,25 +4229,19 @@ LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 	 * of space.
 	 */
 	new_write_head = (lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
-	buffer_full = false;
 	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
 	{
+		/*
+		 * If the buffer is full, move the slowest reader to a separate
+		 * overflow entry and free its space in the buffer so the write head
+		 * can advance.
+		 */
 		if (new_write_head == lag_tracker->read_heads[i])
-			buffer_full = true;
-	}
-
-	/*
-	 * If the buffer is full, for now we just rewind by one slot and overwrite
-	 * the last sample, as a simple (if somewhat uneven) way to lower the
-	 * sampling rate.  There may be better adaptive compaction algorithms.
-	 */
-	if (buffer_full)
-	{
-		new_write_head = lag_tracker->write_head;
-		if (lag_tracker->write_head > 0)
-			lag_tracker->write_head--;
-		else
-			lag_tracker->write_head = LAG_TRACKER_BUFFER_SIZE - 1;
+		{
+			lag_tracker->overflowed[i] =
+				lag_tracker->buffer[lag_tracker->read_heads[i]];
+			lag_tracker->read_heads[i] = -1;
+		}
 	}
 
 	/* Store a sample at the current write head position. */
@@ -4274,6 +4268,28 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 {
 	TimestampTz time = 0;
 
+	/*
+	 * If 'lsn' has not passed the WAL position stored in the overflow entry,
+	 * return the elapsed time (in microseconds) since the saved local flush
+	 * time. If the flush time is in the future (due to clock drift), return
+	 * -1 to treat as no valid sample.
+	 *
+	 * Otherwise, switch back to using the buffer to control the read head and
+	 * compute the elapsed time.  The read head is then reset to point to the
+	 * oldest entry in the buffer.
+	 */
+	if (lag_tracker->read_heads[head] == -1)
+	{
+		if (lag_tracker->overflowed[head].lsn > lsn)
+			return (now >= lag_tracker->overflowed[head].time) ?
+				now - lag_tracker->overflowed[head].time : -1;
+
+		time = lag_tracker->overflowed[head].time;
+		lag_tracker->last_read[head] = lag_tracker->overflowed[head];
+		lag_tracker->read_heads[head] =
+			(lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
+	}
+
 	/* Read all unread samples up to this LSN or end of buffer. */
 	while (lag_tracker->read_heads[head] != lag_tracker->write_head &&
 		   lag_tracker->buffer[lag_tracker->read_heads[head]].lsn <= lsn)
-- 
2.50.1

#8Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#6)
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

On Mon, Oct 20, 2025 at 10:12 AM Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Oct 18, 2025 at 9:16 AM Chao Li <li.evan.chao@gmail.com> wrote:

I think I put all concentration on the big picture yesterday, so I ignored this implementation detail:

```
+               if (lag_tracker->overflowed[head].lsn > lsn)
+                       return now - lag_tracker->overflowed[head].time;
```

Should this “>” be “>=“?

I think either ">" or ">=" would work. However, the following loop
in LagTrackerRead() advances the read head when the LSN
in the current buffer entry is equal to the given LSN

I forgot to mention which loop actually advances the read head in that case.
It's the following code in LagTrackerRead().

------------------------
/* Read all unread samples up to this LSN or end of buffer. */
while (lag_tracker->read_heads[head] != lag_tracker->write_head &&
lag_tracker->buffer[lag_tracker->read_heads[head]].lsn <= lsn)
{
time = lag_tracker->buffer[lag_tracker->read_heads[head]].time;
lag_tracker->last_read[head] =
lag_tracker->buffer[lag_tracker->read_heads[head]];
lag_tracker->read_heads[head] =
(lag_tracker->read_heads[head] + 1) % LAG_TRACKER_BUFFER_SIZE;
}
------------------------

Regards,

--
Fujii Masao

#9Shinya Kato
shinya11.kato@gmail.com
In reply to: Fujii Masao (#7)
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

On Mon, Oct 20, 2025 at 10:17 AM Fujii Masao <masao.fujii@gmail.com> wrote:

On Sun, Oct 19, 2025 at 2:04 AM Shinya Kato <shinya11.kato@gmail.com> wrote:

Thank you for the patch. I have one comment.

+       if (lag_tracker->overflowed[head].lsn > lsn)
+           return now - lag_tracker->overflowed[head].time;

Could this return a negative value if the clock somehow went
backwards? The original code returns -1 in this case, so I'm curious
about this.

Thanks for the review!

Yes, you're right. So I've updated the patch so that -1 is returned
when the current time is earlier than the time in the overflow entry,
treating it as "no new sample found".

Thank you for updating the patch. It looks nice to me.

--
Best regards,
Shinya Kato
NTT OSS Center

#10Fujii Masao
masao.fujii@gmail.com
In reply to: Shinya Kato (#9)
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

On Mon, Oct 20, 2025 at 1:45 PM Shinya Kato <shinya11.kato@gmail.com> wrote:

On Mon, Oct 20, 2025 at 10:17 AM Fujii Masao <masao.fujii@gmail.com> wrote:

On Sun, Oct 19, 2025 at 2:04 AM Shinya Kato <shinya11.kato@gmail.com> wrote:

Thank you for the patch. I have one comment.

+       if (lag_tracker->overflowed[head].lsn > lsn)
+           return now - lag_tracker->overflowed[head].time;

Could this return a negative value if the clock somehow went
backwards? The original code returns -1 in this case, so I'm curious
about this.

Thanks for the review!

Yes, you're right. So I've updated the patch so that -1 is returned
when the current time is earlier than the time in the overflow entry,
treating it as "no new sample found".

Thank you for updating the patch. It looks nice to me.

Thanks for the review!
Unless there are any objections, I'll commit the patch and
backpatch it to all supported branches.

Regards,

--
Fujii Masao

#11Xuneng Zhou
xunengzhou@gmail.com
In reply to: Fujii Masao (#10)
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

Hi Fujii-san,

On Tue, Oct 21, 2025 at 7:45 AM Fujii Masao <masao.fujii@gmail.com> wrote:

On Mon, Oct 20, 2025 at 1:45 PM Shinya Kato <shinya11.kato@gmail.com> wrote:

On Mon, Oct 20, 2025 at 10:17 AM Fujii Masao <masao.fujii@gmail.com> wrote:

On Sun, Oct 19, 2025 at 2:04 AM Shinya Kato <shinya11.kato@gmail.com> wrote:

Thank you for the patch. I have one comment.

+       if (lag_tracker->overflowed[head].lsn > lsn)
+           return now - lag_tracker->overflowed[head].time;

Could this return a negative value if the clock somehow went
backwards? The original code returns -1 in this case, so I'm curious
about this.

Thanks for the review!

Yes, you're right. So I've updated the patch so that -1 is returned
when the current time is earlier than the time in the overflow entry,
treating it as "no new sample found".

Thank you for updating the patch. It looks nice to me.

Thanks for the review!
Unless there are any objections, I'll commit the patch and
backpatch it to all supported branches.

Thanks for working on this.

The patch LGTM. I am wondering whether it is helpful to add some
comments for this overflowed array
WalTimeSample overflowed[NUM_SYNC_REP_WAIT_MODE];

and replacing literal zeros with the constant InvalidXLogRecPtr for
better readability.

/* InvalidXLogRecPtr means no overflow yet */
if (lag_tracker->overflowed[i].lsn == InvalidXLogRecPtr)

Best,
Xuneng

#12Fujii Masao
masao.fujii@gmail.com
In reply to: Xuneng Zhou (#11)
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

On Tue, Oct 21, 2025 at 11:52 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

The patch LGTM.

Thanks for the review!
I've pushed the patch and backpatched it to all supported versions.

I am wondering whether it is helpful to add some
comments for this overflowed array

Yes, do you have any specific suggestions?

and replacing literal zeros with the constant InvalidXLogRecPtr for
better readability.

/* InvalidXLogRecPtr means no overflow yet */
if (lag_tracker->overflowed[i].lsn == InvalidXLogRecPtr)

I couldn't find any code like "lag_tracker->overflowed[i].lsn == 0",
so I'm not sure which part should be replaced with InvalidXLogRecPtr.
Could you point me to the exact location?

Regards,

--
Fujii Masao

#13Xuneng Zhou
xunengzhou@gmail.com
In reply to: Fujii Masao (#12)
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

Hi,

On Wed, Oct 22, 2025 at 10:34 AM Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Oct 21, 2025 at 11:52 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

The patch LGTM.

Thanks for the review!
I've pushed the patch and backpatched it to all supported versions.

I am wondering whether it is helpful to add some
comments for this overflowed array

Yes, do you have any specific suggestions?

How about something like:

/*
* Overflow entries for read heads that collide with the write head.
*
* When the cyclic buffer fills (write head is about to collide with a read
* head), we save that read head's current sample here and mark it as using
* overflow (read_heads[i] = -1). This allows the write head to continue
* advancing while the overflowed mode continues lag computation using the
* saved sample.
*
* Once the standby's reported LSN advances past the overflow entry's LSN,
* we transition back to normal buffer-based tracking.
*/

and replacing literal zeros with the constant InvalidXLogRecPtr for
better readability.

/* InvalidXLogRecPtr means no overflow yet */
if (lag_tracker->overflowed[i].lsn == InvalidXLogRecPtr)

I couldn't find any code like "lag_tracker->overflowed[i].lsn == 0",
so I'm not sure which part should be replaced with InvalidXLogRecPtr.
Could you point me to the exact location?

Sorry for the noise here, I mean
if (lag_tracker->read_heads[head] == InvalidXLogRecPtr) before.
But with this change, position 0 would be treated as overflow mode,
which is clearly wrong...

Best,
Xuneng

#14Fujii Masao
masao.fujii@gmail.com
In reply to: Xuneng Zhou (#13)
1 attachment(s)
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

On Wed, Oct 22, 2025 at 4:49 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

How about something like:

/*
* Overflow entries for read heads that collide with the write head.
*
* When the cyclic buffer fills (write head is about to collide with a read
* head), we save that read head's current sample here and mark it as using
* overflow (read_heads[i] = -1). This allows the write head to continue
* advancing while the overflowed mode continues lag computation using the
* saved sample.
*
* Once the standby's reported LSN advances past the overflow entry's LSN,
* we transition back to normal buffer-based tracking.
*/

LGTM. Thanks!

I've created a patch adding your suggested comments (attached).
Since this is a follow-up to commit 883a95646a8, which was recently applied
and backpatched to all supported branches, I think we should backpatch
this one as well. Thought?

Regards,

--
Fujii Masao

Attachments:

v1-0001-Add-comments-explaining-overflow-entries-in-the-r.patchapplication/octet-stream; name=v1-0001-Add-comments-explaining-overflow-entries-in-the-r.patchDownload
From 8860671a4df50497163ae0fac2671117b4c1600c Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Wed, 22 Oct 2025 23:22:34 +0900
Subject: [PATCH v1] Add comments explaining overflow entries in the
 replication lag tracker.

Commit 883a95646a8 introduced overflow entries in the replication lag tracker
to fix an issue where lag columns in pg_stat_replication could stall when
the replay LSN stopped advancing.

This commit adds comments clarifying the purpose and behavior of overflow
entries to improve code readability and understanding.

Since commit 883a95646a8 was recently applied and backpatched to all
supported branches, this follow-up commit is also backpatched accordingly.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VxqQA_DePxyZ7Y8V+ErYyXkmwJ1P6NC+YC+cvxMipWKw@mail.gmail.com
Backpatch-through: 13
---
 src/backend/replication/walsender.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1ce21a2ad98..548eafa7a73 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -233,6 +233,19 @@ typedef struct
 	int			write_head;
 	int			read_heads[NUM_SYNC_REP_WAIT_MODE];
 	WalTimeSample last_read[NUM_SYNC_REP_WAIT_MODE];
+
+	/*
+	 * Overflow entries for read heads that collide with the write head.
+	 *
+	 * When the cyclic buffer fills (write head is about to collide with a
+	 * read head), we save that read head's current sample here and mark it as
+	 * using overflow (read_heads[i] = -1). This allows the write head to
+	 * continue advancing while the overflowed mode continues lag computation
+	 * using the saved sample.
+	 *
+	 * Once the standby's reported LSN advances past the overflow entry's LSN,
+	 * we transition back to normal buffer-based tracking.
+	 */
 	WalTimeSample overflowed[NUM_SYNC_REP_WAIT_MODE];
 } LagTracker;
 
-- 
2.50.1

#15Xuneng Zhou
xunengzhou@gmail.com
In reply to: Fujii Masao (#14)
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

Hi,

On Wed, Oct 22, 2025 at 10:45 PM Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Oct 22, 2025 at 4:49 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

How about something like:

/*
* Overflow entries for read heads that collide with the write head.
*
* When the cyclic buffer fills (write head is about to collide with a read
* head), we save that read head's current sample here and mark it as using
* overflow (read_heads[i] = -1). This allows the write head to continue
* advancing while the overflowed mode continues lag computation using the
* saved sample.
*
* Once the standby's reported LSN advances past the overflow entry's LSN,
* we transition back to normal buffer-based tracking.
*/

LGTM. Thanks!

I've created a patch adding your suggested comments (attached).
Since this is a follow-up to commit 883a95646a8, which was recently applied
and backpatched to all supported branches, I think we should backpatch
this one as well. Thought?

LGTM. Thanks for the patch!

Best,
Xuneng

#16Fujii Masao
masao.fujii@gmail.com
In reply to: Xuneng Zhou (#15)
Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

On Thu, Oct 23, 2025 at 12:26 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

I've created a patch adding your suggested comments (attached).
Since this is a follow-up to commit 883a95646a8, which was recently applied
and backpatched to all supported branches, I think we should backpatch
this one as well. Thought?

LGTM. Thanks for the patch!

I've pushed the patch. Thanks!

Regards,

--
Fujii Masao