pg_stat_replication.*_lag sometimes shows NULL during active replication

Started by Shinya Kato14 days ago6 messages

shinya11.kato@gmail.com

14 days ago

Hi hackers,

I have noticed that pg_stat_replication.*_lag sometimes shows NULL
when inserting a record per second for health checking. This happens
when the startup process replays WAL fast enough before the
walreceiver sends its flush notification to the walsender.

Here is the sequence that triggers the issue: (See normal.svg and
error.svg for diagrams of the normal and problematic cases.)

1. The walreceiver receives, writes, and flushes WAL, then wakes the
startup process via WakeupRecovery().

2. The startup process replays all available WAL quickly, then calls
WalRcvForceReply() to set force_reply = true and wakes the
walreceiver.

3. The walreceiver sends a flush notification to the walsender
(XLogWalRcvSendReply() in XLogWalRcvFlush()). Since the startup has
already replayed the WAL by this point, this message reports the
incremented applyPtr, which equals sentPtr. The walsender processes
this message, consuming the LagTracker samples and setting
fullyAppliedLastTime = true.

4. In the next loop iteration, the walreceiver sees force_reply = true
and sends another reply with the same positions. The walsender sees
applyPtr == sentPtr for the second consecutive time and sets
clearLagTimes = true. Since the LagTracker samples were already
consumed by step 3, all lag values are -1. With clearLagTimes = true,
these -1 values are written to walsnd->*Lag, causing
pg_stat_replication to show NULL.

The comment in ProcessStandbyReplyMessage() says:

* If the standby reports that it has fully replayed the WAL in two
* consecutive reply messages, then the second such message must result
* from wal_receiver_status_interval expiring on the standby.

But as shown above, the second message can also come from
WalRcvForceReply(), violating this assumption.

The attached patch fixes this by adding a check that all lag values
are -1 to the clearLagTimes condition. This ensures that clearLagTimes
only triggers when there are truly no new lag samples in two
consecutive messages (i.e., the system is genuinely idle), and not
when the samples were simply consumed by a preceding message in a
burst of replies.

Regards,

--
Best regards,
Shinya Kato
NTT OSS Center

Fujii Masao

masao.fujii@gmail.com

8 days ago

In reply to: Shinya Kato (#1)

Re: pg_stat_replication.*_lag sometimes shows NULL during active replication

On Tue, Feb 24, 2026 at 3:54 PM Shinya Kato <shinya11.kato@gmail.com> wrote:

Hi hackers,

I have noticed that pg_stat_replication.*_lag sometimes shows NULL
when inserting a record per second for health checking. This happens
when the startup process replays WAL fast enough before the
walreceiver sends its flush notification to the walsender.

Here is the sequence that triggers the issue: (See normal.svg and
error.svg for diagrams of the normal and problematic cases.)

1. The walreceiver receives, writes, and flushes WAL, then wakes the
startup process via WakeupRecovery().

2. The startup process replays all available WAL quickly, then calls
WalRcvForceReply() to set force_reply = true and wakes the
walreceiver.

3. The walreceiver sends a flush notification to the walsender
(XLogWalRcvSendReply() in XLogWalRcvFlush()). Since the startup has
already replayed the WAL by this point, this message reports the
incremented applyPtr, which equals sentPtr. The walsender processes
this message, consuming the LagTracker samples and setting
fullyAppliedLastTime = true.

4. In the next loop iteration, the walreceiver sees force_reply = true
and sends another reply with the same positions. The walsender sees
applyPtr == sentPtr for the second consecutive time and sets
clearLagTimes = true. Since the LagTracker samples were already
consumed by step 3, all lag values are -1. With clearLagTimes = true,
these -1 values are written to walsnd->*Lag, causing
pg_stat_replication to show NULL.

The comment in ProcessStandbyReplyMessage() says:

* If the standby reports that it has fully replayed the WAL in two
* consecutive reply messages, then the second such message must result
* from wal_receiver_status_interval expiring on the standby.

But as shown above, the second message can also come from
WalRcvForceReply(), violating this assumption.

The attached patch fixes this by adding a check that all lag values
are -1 to the clearLagTimes condition. This ensures that clearLagTimes
only triggers when there are truly no new lag samples in two
consecutive messages (i.e., the system is genuinely idle), and not
when the samples were simply consumed by a preceding message in a
burst of replies.

Thanks for the patch!

With the patch applied, I set up a logical replication and inserted a row every
second. Even with continuous inserts, NULL was shown in the lag columns of
pg_stat_replication. That makes me wonder whether the patch's approach is
sufficient to address the issue.

Relying solely on replies from the standby or subscriber seems a bit fragile to
me. If the goal is to keep showing the last measured lag for some time,
perhaps we should introduce a rate limit on when NULL is displayed in the lag
columns?

For example, if there has been no activity (i.e., sentPtr == applyPtr and
applyPtr has not changed since the previous cycle) for, say, 10 seconds,
then we could allow NULL to be shown. Thought?

Regards,

--
Fujii Masao

Shinya Kato

shinya11.kato@gmail.com

4 days ago

In reply to: Fujii Masao (#2)

Re: pg_stat_replication.*_lag sometimes shows NULL during active replication

On Mon, Mar 2, 2026 at 11:44 PM Fujii Masao <masao.fujii@gmail.com> wrote:

With the patch applied, I set up a logical replication and inserted a row every
second. Even with continuous inserts, NULL was shown in the lag columns of
pg_stat_replication. That makes me wonder whether the patch's approach is
sufficient to address the issue.

Thank you for the review and testing! I had only considered the issue
in the context of physical replication, but as you pointed out, my
approach is insufficient for logical replication.

Relying solely on replies from the standby or subscriber seems a bit fragile to
me. If the goal is to keep showing the last measured lag for some time,
perhaps we should introduce a rate limit on when NULL is displayed in the lag
columns?

My primary goal was to ensure that the source code comments match the
actual behavior, as the comment stating "the second such message must
result from wal_receiver_status_interval expiring on the standby" is
inaccurate. However, as you noted, the patch alone is not sufficient
to fully address the issue.

For example, if there has been no activity (i.e., sentPtr == applyPtr and
applyPtr has not changed since the previous cycle) for, say, 10 seconds,
then we could allow NULL to be shown. Thought?

I considered a time-based rate limit, but it is difficult to choose an
appropriate threshold. Furthermore, the walsender has no way of
knowing the standby's or subscriber's wal_receiver_status_interval
setting.

The attached v2 patch takes a different approach: it additionally
requires that all reported positions (write/flush/apply) remain
unchanged from the previous reply. This directly detects a truly idle
system without relying on timeouts—if any position has advanced, new
WAL activity must have occurred, so we should not clear the lag values
even if the lag tracker is empty.
--
Best regards,
Shinya Kato
NTT OSS Center

Fujii Masao

masao.fujii@gmail.com

about 17 hours ago

In reply to: Shinya Kato (#3)

Re: pg_stat_replication.*_lag sometimes shows NULL during active replication

On Fri, Mar 6, 2026 at 4:13 PM Shinya Kato <shinya11.kato@gmail.com> wrote:

On Mon, Mar 2, 2026 at 11:44 PM Fujii Masao <masao.fujii@gmail.com> wrote:

With the patch applied, I set up a logical replication and inserted a row every
second. Even with continuous inserts, NULL was shown in the lag columns of
pg_stat_replication. That makes me wonder whether the patch's approach is
sufficient to address the issue.

Thank you for the review and testing! I had only considered the issue
in the context of physical replication, but as you pointed out, my
approach is insufficient for logical replication.

Relying solely on replies from the standby or subscriber seems a bit fragile to
me. If the goal is to keep showing the last measured lag for some time,
perhaps we should introduce a rate limit on when NULL is displayed in the lag
columns?

My primary goal was to ensure that the source code comments match the
actual behavior, as the comment stating "the second such message must
result from wal_receiver_status_interval expiring on the standby" is
inaccurate. However, as you noted, the patch alone is not sufficient
to fully address the issue.

For example, if there has been no activity (i.e., sentPtr == applyPtr and
applyPtr has not changed since the previous cycle) for, say, 10 seconds,
then we could allow NULL to be shown. Thought?

I considered a time-based rate limit, but it is difficult to choose an
appropriate threshold. Furthermore, the walsender has no way of
knowing the standby's or subscriber's wal_receiver_status_interval
setting.

The attached v2 patch takes a different approach: it additionally
requires that all reported positions (write/flush/apply) remain
unchanged from the previous reply. This directly detects a truly idle
system without relying on timeouts—if any position has advanced, new
WAL activity must have occurred, so we should not clear the lag values
even if the lag tracker is empty.

This approach looks good to me.

One comment: currently, the lag becomes NULL basically after about one
wal_receiver_status_interval during periods of no activity. OTOH, with this
approach, it seems it would take about twice wal_receiver_status_interval.
Is this understanding correct?

Regards,

--
Fujii Masao

Shinya Kato

shinya11.kato@gmail.com

about 4 hours ago

In reply to: Fujii Masao (#4)

Re: pg_stat_replication.*_lag sometimes shows NULL during active replication

On Mon, Mar 9, 2026 at 8:21 PM Fujii Masao <masao.fujii@gmail.com> wrote:

The attached v2 patch takes a different approach: it additionally
requires that all reported positions (write/flush/apply) remain
unchanged from the previous reply. This directly detects a truly idle
system without relying on timeouts—if any position has advanced, new
WAL activity must have occurred, so we should not clear the lag values
even if the lag tracker is empty.

This approach looks good to me.

Thank you for looking into this.

One comment: currently, the lag becomes NULL basically after about one
wal_receiver_status_interval during periods of no activity. OTOH, with this
approach, it seems it would take about twice wal_receiver_status_interval.
Is this understanding correct?

Exactly. With this patch, it takes about two
wal_receiver_status_interval cycles to show NULL instead of one. I
think this is an acceptable trade-off because it is better to take a
bit longer to detect inactivity than to incorrectly show NULL during
active replication.

--
Best regards,
Shinya Kato
NTT OSS Center

Fujii Masao

masao.fujii@gmail.com

about 3 hours ago

In reply to: Shinya Kato (#5)

Re: pg_stat_replication.*_lag sometimes shows NULL during active replication

On Tue, Mar 10, 2026 at 10:02 AM Shinya Kato <shinya11.kato@gmail.com> wrote:

On Mon, Mar 9, 2026 at 8:21 PM Fujii Masao <masao.fujii@gmail.com> wrote:

The attached v2 patch takes a different approach: it additionally
requires that all reported positions (write/flush/apply) remain
unchanged from the previous reply. This directly detects a truly idle
system without relying on timeouts—if any position has advanced, new
WAL activity must have occurred, so we should not clear the lag values
even if the lag tracker is empty.

This approach looks good to me.

Thank you for looking into this.

One comment: currently, the lag becomes NULL basically after about one
wal_receiver_status_interval during periods of no activity. OTOH, with this
approach, it seems it would take about twice wal_receiver_status_interval.
Is this understanding correct?

Exactly. With this patch, it takes about two
wal_receiver_status_interval cycles to show NULL instead of one. I
think this is an acceptable trade-off because it is better to take a
bit longer to detect inactivity than to incorrectly show NULL during
active replication.

Even with your latest patch, if we remove fullyAppliedLastTime, and set
clearLagTimes to true when applyPtr == sentPtr && noLagSamples &&
positionsUnchanged,
wouldn't the time for the lag to become NULL be almost the same as
wal_receiver_status_interval?

The documentation doesn't clearly specify how long it should take for
the lag to become NULL, so doubling that time might be acceptable.
However, if we can keep it roughly the same without much complexity,
I think that would be preferable.

Thought?

--
Fujii Masao

pg_stat_replication.*_lag sometimes shows NULL during active replication

Attachments:

Attachments: