Add pg_stat_recovery system view

xunengzhou@gmail.com

about 2 months ago

In reply to: Xuneng Zhou (#1)

Re: Add pg_stat_recovery system view

Hi,

On Tue, Jan 27, 2026 at 3:23 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Hackers,

This patch series introduces pg_stat_recovery, a new view that exposes
the startup process’s internal recovery state at the SQL level. It
follows the pattern of pg_stat_wal_receiver, but focuses on WAL replay
and recovery rather than WAL reception.

The view provides visibility into replay progress, recovery timing,
and operational status—information that was previously scattered
across separate function calls or not exposed at all.

As Michael suggested [1][2], this view is intentionally independent of
walreceiver state, since the startup process can consume WAL from
multiple sources (archive, pg_wal, or streaming), not just streaming
replication.

The patch series:
0001: Refactor: move XLogRecoveryCtlData struct to xlogrecovery.h
Move the XLogRecoveryCtlData struct definition from xlogrecovery.c to
xlogrecovery.h,

0002: Add pg_stat_recovery system view
Introduces the core view with columns

0003: Refactor: move XLogSource enum to xlogrecovery.h
Preparatory refactoring to make XLogSource visible externally.

0004: Add wal_source column to pg_stat_recovery
Adds wal_source column showing where WAL was last read from:
'archive', 'pg_wal', or 'stream'.

Example usage on a standby:

SELECT promote_triggered, pause_state, wal_source,
pg_size_pretty(pg_wal_lsn_diff(replay_end_lsn, last_replayed_end_lsn))
FROM pg_stat_recovery;

promote_triggered | pause_state | wal_source | pg_size_pretty
-------------------+-------------+------------+----------------
f | not paused | stream | 0 bytes

Feedbacks welcome.

[1] /messages/by-id/aW13GJn_RfTJIFCa@paquier.xyz
[2] /messages/by-id/aW68b79-9U3WPZiz@paquier.xyz

--
Best,
Xuneng

Just rebase.

--
Best,
Xuneng

michael@paquier.xyz

about 2 months ago

In reply to: Xuneng Zhou (#2)

Re: Add pg_stat_recovery system view

On Wed, Mar 04, 2026 at 09:02:29AM +0800, Xuneng Zhou wrote:

Just rebase.

I have applied 0001, that simply moves some code around.

Regarding 0002, I would recommend to not bump the catalog version in
catversion.h when sending a patch to the lists. This task is left to
committers when the code gets merged into the tree. And this is
annoying for submitters because it can create a lot of conflicts.

Using a set-returning function is I think wrong here, betraying the
representation of the recovery status as stored in the system. We
know that there is only one state of recovery, fixed in shared memory.
Like the cousins of this new function, let's make thinks non-SRF,
returning one row all the time with PG_RETURN_NULL() if the conditions
for information display are not satisfied. When we are not in
recovery or when the role querying the function is not granted
ROLE_PG_READ_ALL_STATS, that will simplify the patch as there is no
need for the else branch with the nulls, as written in your patch.

The field values are acquired the right way, spinlock acquisitions
have to be short.

Like pg_stat_wal_receiver, let's add to the view definition a qual to
return a row only if the fields are not NULL.

pg_get_wal_replay_pause_state() displays the pause_state, and it is
not hidden behind the stats read role. I am not really convinced that
this is worth bothering to treat as an exception in this patch. It
makes it a bit more complicated, for little gain. I would just hide
all the fields behind the role granted, to keep the code simpler, even
if that's slightly inconsistent with pg_get_wal_replay_pause_state().

After writing this last point, as far as I can see there is a small
optimization possible in the patch. When a role is not granted
ROLE_PG_READ_ALL_STATS, we know that we will not return any
information so we could skip the spinlock acquisition and avoid
spinlock acquisitions when one queries the function but has no access
to its data.

+ True if a promotion has been triggered for this standby server.

Standby servers are implied if data is returned, this sentence can be
simpler and cut the "standby server" part.

+       Start LSN of the last WAL record replayed during recovery. 
[...]
+       End LSN of the last WAL record replayed during recovery.
[...]
+       Timeline of the last replayed WAL record. 
For other system views with LSN information, we don't use "LSN", but
"write-ahead log location".  I'd suggest the same term here.  These
three fields refer to the last record *successfully* replayed.  It
seems important to mention this fact, I guess?

+       <structfield>replay_end_lsn</structfield> <type>pg_lsn</type>
+      </para>
+      <para>
+       Current replay position. When replaying a record, this is the end
+       position of the record being replayed; otherwise it equals
+       <structfield>last_replayed_end_lsn</structfield>.

Slightly inexact. This is the end LSN + 1.

+       <structfield>replay_end_lsn</structfield> <type>pg_lsn</type>
[..]
+       <structfield>replay_end_tli</structfield> <type>integer</type>

Okay, I can fall behind the addition of these two, it could be helpful
in debugging something like a locking issue when replaying a specific
record, I guess. At least I'd want to know what is happening for a
record currently being replayed. It seems to me that this could be
more precise, mentioning that these refer to a record *currently*
being replayed.

Is recoveryLastXTime actually that relevant to have? We use it for
logging and for recovery target times, which is something less
relevant than the other fields, especially if we think about standbys
where these have no targets to reach.

currentChunkStartTime, on the contrary, is much more relevant to me,
due to the fact that we use it in WaitForWALToBecomeAvailable() with
active WAL receivers running.
--
Michael

xunengzhou@gmail.com

about 2 months ago

In reply to: Michael Paquier (#3)

Re: Add pg_stat_recovery system view

Hi Michael,

Thanks for the detailed feedback!

On Thu, Mar 5, 2026 at 11:58 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Mar 04, 2026 at 09:02:29AM +0800, Xuneng Zhou wrote:

Just rebase.

I have applied 0001, that simply moves some code around.

Thanks for applying.

Regarding 0002, I would recommend to not bump the catalog version in
catversion.h when sending a patch to the lists. This task is left to
committers when the code gets merged into the tree. And this is
annoying for submitters because it can create a lot of conflicts.

I'll leave it untouched.

Using a set-returning function is I think wrong here, betraying the
representation of the recovery status as stored in the system. We
know that there is only one state of recovery, fixed in shared memory.
Like the cousins of this new function, let's make thinks non-SRF,
returning one row all the time with PG_RETURN_NULL() if the conditions
for information display are not satisfied. When we are not in
recovery or when the role querying the function is not granted
ROLE_PG_READ_ALL_STATS, that will simplify the patch as there is no
need for the else branch with the nulls, as written in your patch.
The field values are acquired the right way, spinlock acquisitions
have to be short.

My earlier thought for keeping pg_stat_get_recovery as an SRF is to
make pg_stat_recovery simpler by avoiding an extra filter such as
WHERE s.promote_triggered IS NOT NULL to preserve 0/1-row semantics.
pg_stat_get_recovery_prefetch also uses the SRF pattern.

Like pg_stat_wal_receiver, let's add to the view definition a qual to
return a row only if the fields are not NULL.

pg_get_wal_replay_pause_state() displays the pause_state, and it is
not hidden behind the stats read role. I am not really convinced that
this is worth bothering to treat as an exception in this patch. It
makes it a bit more complicated, for little gain. I would just hide
all the fields behind the role granted, to keep the code simpler, even
if that's slightly inconsistent with pg_get_wal_replay_pause_state().

I agree that exposing a subset of columns unconditionally is not worth
the added complexity.

After writing this last point, as far as I can see there is a small
optimization possible in the patch. When a role is not granted
ROLE_PG_READ_ALL_STATS, we know that we will not return any
information so we could skip the spinlock acquisition and avoid
spinlock acquisitions when one queries the function but has no access
to its data.

Makes sense to me. I'll apply it as suggested.

+ True if a promotion has been triggered for this standby server.

Standby servers are implied if data is returned, this sentence can be
simpler and cut the "standby server" part.

+ 1

+       Start LSN of the last WAL record replayed during recovery.
[...]
+       End LSN of the last WAL record replayed during recovery.
[...]
+       Timeline of the last replayed WAL record.
For other system views with LSN information, we don't use "LSN", but
"write-ahead log location".  I'd suggest the same term here.  These
three fields refer to the last record *successfully* replayed.  It
seems important to mention this fact, I guess?

I'll replace them.

+       <structfield>replay_end_lsn</structfield> <type>pg_lsn</type>
+      </para>
+      <para>
+       Current replay position. When replaying a record, this is the end
+       position of the record being replayed; otherwise it equals
+       <structfield>last_replayed_end_lsn</structfield>.

Slightly inexact. This is the end LSN + 1.

Yeh, this needs to be corrected.

+       <structfield>replay_end_lsn</structfield> <type>pg_lsn</type>
[..]
+       <structfield>replay_end_tli</structfield> <type>integer</type>
Okay, I can fall behind the addition of these two, it could be helpful
in debugging something like a locking issue when replaying a specific
record, I guess. At least I'd want to know what is happening for a
record currently being replayed. It seems to me that this could be
more precise, mentioning that these refer to a record *currently*
being replayed.

I will adjust the docs to say these describe the record currently
being replayed, with replay_end_lsn being the end position + 1.

Is recoveryLastXTime actually that relevant to have? We use it for
logging and for recovery target times, which is something less
relevant than the other fields, especially if we think about standbys
where these have no targets to reach.

I agree it is less central for standby monitoring (and partly overlaps
with pg_last_xact_replay_timestamp()), so I’ll remove it from this
view in the next revision.

currentChunkStartTime, on the contrary, is much more relevant to me,
due to the fact that we use it in WaitForWALToBecomeAvailable() with
active WAL receivers running.

Yeah, it could be useful for apply-delay/catch-up diagnostics.

--
Best,
Xuneng

xunengzhou@gmail.com

about 2 months ago

In reply to: Xuneng Zhou (#4)

Re: Add pg_stat_recovery system view

Hi,

On Thu, Mar 5, 2026 at 3:37 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Michael,

Thanks for the detailed feedback!

On Thu, Mar 5, 2026 at 11:58 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Mar 04, 2026 at 09:02:29AM +0800, Xuneng Zhou wrote:

Just rebase.

I have applied 0001, that simply moves some code around.

Thanks for applying.

Regarding 0002, I would recommend to not bump the catalog version in
catversion.h when sending a patch to the lists. This task is left to
committers when the code gets merged into the tree. And this is
annoying for submitters because it can create a lot of conflicts.

I'll leave it untouched.

Using a set-returning function is I think wrong here, betraying the
representation of the recovery status as stored in the system. We
know that there is only one state of recovery, fixed in shared memory.
Like the cousins of this new function, let's make thinks non-SRF,
returning one row all the time with PG_RETURN_NULL() if the conditions
for information display are not satisfied. When we are not in
recovery or when the role querying the function is not granted
ROLE_PG_READ_ALL_STATS, that will simplify the patch as there is no
need for the else branch with the nulls, as written in your patch.
The field values are acquired the right way, spinlock acquisitions
have to be short.

My earlier thought for keeping pg_stat_get_recovery as an SRF is to
make pg_stat_recovery simpler by avoiding an extra filter such as
WHERE s.promote_triggered IS NOT NULL to preserve 0/1-row semantics.
pg_stat_get_recovery_prefetch also uses the SRF pattern.

Like pg_stat_wal_receiver, let's add to the view definition a qual to
return a row only if the fields are not NULL.

pg_get_wal_replay_pause_state() displays the pause_state, and it is
not hidden behind the stats read role. I am not really convinced that
this is worth bothering to treat as an exception in this patch. It
makes it a bit more complicated, for little gain. I would just hide
all the fields behind the role granted, to keep the code simpler, even
if that's slightly inconsistent with pg_get_wal_replay_pause_state().

I agree that exposing a subset of columns unconditionally is not worth
the added complexity.

After writing this last point, as far as I can see there is a small
optimization possible in the patch. When a role is not granted
ROLE_PG_READ_ALL_STATS, we know that we will not return any
information so we could skip the spinlock acquisition and avoid
spinlock acquisitions when one queries the function but has no access
to its data.

Makes sense to me. I'll apply it as suggested.

+ True if a promotion has been triggered for this standby server.

Standby servers are implied if data is returned, this sentence can be
simpler and cut the "standby server" part.

+ 1
+       Start LSN of the last WAL record replayed during recovery.
[...]
+       End LSN of the last WAL record replayed during recovery.
[...]
+       Timeline of the last replayed WAL record.
For other system views with LSN information, we don't use "LSN", but
"write-ahead log location".  I'd suggest the same term here.  These
three fields refer to the last record *successfully* replayed.  It
seems important to mention this fact, I guess?
I'll replace them.
+       <structfield>replay_end_lsn</structfield> <type>pg_lsn</type>
+      </para>
+      <para>
+       Current replay position. When replaying a record, this is the end
+       position of the record being replayed; otherwise it equals
+       <structfield>last_replayed_end_lsn</structfield>.
Slightly inexact. This is the end LSN + 1.
Yeh, this needs to be corrected.
+       <structfield>replay_end_lsn</structfield> <type>pg_lsn</type>
[..]
+       <structfield>replay_end_tli</structfield> <type>integer</type>
Okay, I can fall behind the addition of these two, it could be helpful
in debugging something like a locking issue when replaying a specific
record, I guess. At least I'd want to know what is happening for a
record currently being replayed. It seems to me that this could be
more precise, mentioning that these refer to a record *currently*
being replayed.
I will adjust the docs to say these describe the record currently
being replayed, with replay_end_lsn being the end position + 1.

Is recoveryLastXTime actually that relevant to have? We use it for
logging and for recovery target times, which is something less
relevant than the other fields, especially if we think about standbys
where these have no targets to reach.

I agree it is less central for standby monitoring (and partly overlaps
with pg_last_xact_replay_timestamp()), so I’ll remove it from this
view in the next revision.

currentChunkStartTime, on the contrary, is much more relevant to me,
due to the fact that we use it in WaitForWALToBecomeAvailable() with
active WAL receivers running.

Yeah, it could be useful for apply-delay/catch-up diagnostics.

Here is the updated patch set. Please take a look.

--
Best,
Xuneng

michael@paquier.xyz

about 2 months ago

In reply to: Xuneng Zhou (#5)

Re: Add pg_stat_recovery system view

On Thu, Mar 05, 2026 at 09:01:35PM +0800, Xuneng Zhou wrote:

Here is the updated patch set. Please take a look.

Thanks. It was mostly fine, still:
- The extra mention of ROLE_PG_READ_ALL_STATS was not necessary. We
already have one at the top of the section.
- After sleeping on it, I have decided to keep recoveryLastXTime,
partially because we have a system function to retrieve it already.
- There was a refactoring piece possible when we switch a
RecoveryPauseState to a non-translatable string, which I have applied
separately.
- Added one test with pg_stat_recovery checked on a standby.
--
Michael

Shinya Kato

shinya11.kato@gmail.com

about 2 months ago

In reply to: Michael Paquier (#6)

Re: Add pg_stat_recovery system view

Reading the committed patch, it appears that the values for
current_chunk_start_time and recovery_last_xact_time have been
swapped. Could you please look into this?

I have attached the patch (with redundant comments removed).

--
Best regards,
Shinya Kato
NTT OSS Center

michael@paquier.xyz

about 2 months ago

In reply to: Shinya Kato (#7)

Re: Add pg_stat_recovery system view

On Fri, Mar 06, 2026 at 02:19:01PM +0900, Shinya Kato wrote:

Reading the committed patch, it appears that the values for
current_chunk_start_time and recovery_last_xact_time have been
swapped. Could you please look into this?

I have attached the patch (with redundant comments removed).

Oops, indeed. Will fix shortly.
--
Michael

yangyz

1197620467@qq.com

about 2 months ago

In reply to: Xuneng Zhou (#5)

Re: Add pg_stat_recovery system view

I reviewed the patch you submitted and identified two issues.

1.In the Add pg_stat_recovery system view patch file, the documentation 
modification indicates that the lack of permissions only results in the inability 
to view a few specific columns. But the implementation of the code is:

 if (! has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS))
PG_RETURN_NULL(); 

If there is no permission, return an empty line. This is inconsistent with the written document.
2.In the function pg_stat_get_recovery(), the two arrays "Datum *values" and "bool *nulls" consist of a fixed set of seven elements, so there is no need for dynamic allocation.

Regards,

Yuanzhuo

原始邮件

发件人：Xuneng Zhou <xunengzhou@gmail.com>
发件时间：2026年3月5日 21:01
收件人：Michael Paquier <michael@paquier.xyz>
抄送：pgsql-hackers <pgsql-hackers@lists.postgresql.org>
主题：Re: Add pg_stat_recovery system view

Hi,

On Thu, Mar 5, 2026 at 3:37 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi Michael,
>
> Thanks for the detailed feedback!
>
> On Thu, Mar 5, 2026 at 11:58 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Wed, Mar 04, 2026 at 09:02:29AM +0800, Xuneng Zhou wrote:
> > > Just rebase.
> >
> > I have applied 0001, that simply moves some code around.
> >
>
> Thanks for applying.
>
> > Regarding 0002, I would recommend to not bump the catalog version in
> > catversion.h when sending a patch to the lists.  This task is left to
> > committers when the code gets merged into the tree.  And this is
> > annoying for submitters because it can create a lot of conflicts.
>
> I'll leave it untouched.
>
> > Using a set-returning function is I think wrong here, betraying the
> > representation of the recovery status as stored in the system.  We
> > know that there is only one state of recovery, fixed in shared memory.
> > Like the cousins of this new function, let's make thinks non-SRF,
> > returning one row all the time with PG_RETURN_NULL() if the conditions
> > for information display are not satisfied.  When we are not in
> > recovery or when the role querying the function is not granted
> > ROLE_PG_READ_ALL_STATS, that will simplify the patch as there is no
> > need for the else branch with the nulls, as written in your patch.
> > The field values are acquired the right way, spinlock acquisitions
> > have to be short.
>
> My earlier thought for keeping pg_stat_get_recovery as an SRF is to
> make pg_stat_recovery simpler by avoiding an extra filter such as
> WHERE s.promote_triggered IS NOT NULL to preserve 0/1-row semantics.
> pg_stat_get_recovery_prefetch also uses the SRF pattern.
>
> > Like pg_stat_wal_receiver, let's add to the view definition a qual to
> > return a row only if the fields are not NULL.
> >
> > pg_get_wal_replay_pause_state() displays the pause_state, and it is
> > not hidden behind the stats read role.  I am not really convinced that
> > this is worth bothering to treat as an exception in this patch.  It
> > makes it a bit more complicated, for little gain.  I would just hide
> > all the fields behind the role granted, to keep the code simpler, even
> > if that's slightly inconsistent with pg_get_wal_replay_pause_state().
>
> I agree that exposing a subset of columns unconditionally is not worth
> the added complexity.
>
> > After writing this last point, as far as I can see there is a small
> > optimization possible in the patch.  When a role is not granted
> > ROLE_PG_READ_ALL_STATS, we know that we will not return any
> > information so we could skip the spinlock acquisition and avoid
> > spinlock acquisitions when one queries the function but has no access
> > to its data.
>
> Makes sense to me. I'll apply it as suggested.
>
> > +       True if a promotion has been triggered for this standby server.
> >
> > Standby servers are implied if data is returned, this sentence can be
> > simpler and cut the "standby server" part.
>
> + 1
>
> > +       Start LSN of the last WAL record replayed during recovery.
> > [...]
> > +       End LSN of the last WAL record replayed during recovery.
> > [...]
> > +       Timeline of the last replayed WAL record.
> > For other system views with LSN information, we don't use "LSN", but
> > "write-ahead log location".  I'd suggest the same term here.  These
> > three fields refer to the last record *successfully* replayed.  It
> > seems important to mention this fact, I guess?
>
> I'll replace them.
>
> > +       <structfield>replay_end_lsn</structfield> <type>pg_lsn</type>
> > +      </para>
> > +      <para>
> > +       Current replay position. When replaying a record, this is the end
> > +       position of the record being replayed; otherwise it equals
> > +       <structfield>last_replayed_end_lsn</structfield>.
> >
> > Slightly inexact.  This is the end LSN + 1.
>
> Yeh, this needs to be corrected.
>
> > +       <structfield>replay_end_lsn</structfield> <type>pg_lsn</type>
> > [..]
> > +       <structfield>replay_end_tli</structfield> <type>integer</type>
> >
> > Okay, I can fall behind the addition of these two, it could be helpful
> > in debugging something like a locking issue when replaying a specific
> > record, I guess.  At least I'd want to know what is happening for a
> > record currently being replayed.  It seems to me that this could be
> > more precise, mentioning that these refer to a record *currently*
> > being replayed.
>
> I will adjust the docs to say these describe the record currently
> being replayed, with replay_end_lsn being the end position + 1.
>
> > Is recoveryLastXTime actually that relevant to have?  We use it for
> > logging and for recovery target times, which is something less
> > relevant than the other fields, especially if we think about standbys
> > where these have no targets to reach.
>
> I agree it is less central for standby monitoring (and partly overlaps
> with pg_last_xact_replay_timestamp()), so I’ll remove it from this
> view in the next revision.
>
> > currentChunkStartTime, on the contrary, is much more relevant to me,
> > due to the fact that we use it in WaitForWALToBecomeAvailable() with
> > active WAL receivers running.
>
> Yeah, it could be useful for apply-delay/catch-up diagnostics.
>

Here is the updated patch set. Please take a look.

--
Best,
Xuneng

#10

xunengzhou@gmail.com

about 2 months ago

In reply to: Michael Paquier (#6)

Re: Add pg_stat_recovery system view

Hi,

On Fri, Mar 6, 2026 at 12:00 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Mar 05, 2026 at 09:01:35PM +0800, Xuneng Zhou wrote:

Here is the updated patch set. Please take a look.

Thanks. It was mostly fine, still:
- The extra mention of ROLE_PG_READ_ALL_STATS was not necessary. We
already have one at the top of the section.
- After sleeping on it, I have decided to keep recoveryLastXTime,
partially because we have a system function to retrieve it already.
- There was a refactoring piece possible when we switch a
RecoveryPauseState to a non-translatable string, which I have applied
separately

It looks cleaner.

- Added one test with pg_stat_recovery checked on a standby.
--
Michael

Thanks for improving and pushing!

--
Best,
Xuneng

#11

xunengzhou@gmail.com

about 2 months ago

In reply to: yangyz (#9)

Re: Add pg_stat_recovery system view

Hi Yuanzhuo,

Thanks for looking into it.

On Fri, Mar 6, 2026 at 1:48 PM yangyz <1197620467@qq.com> wrote:

I reviewed the patch you submitted and identified two issues.

1.In the Add pg_stat_recovery system view patch file, the documentation
modification indicates that the lack of permissions only results in the
inability
to view a few specific columns. But the implementation of the code is:

if (! has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS))
PG_RETURN_NULL();

If there is no permission, return an empty line. This is inconsistent with

the written document.

Yeah, this is a mismatch in the patch v3. However, it's been removed by
Michael in the commit. So it should be fine in HEAD.

2.In the function pg_stat_get_recovery(), the two arrays "Datum *values"
and "bool *nulls" consist of a fixed set of seven elements, so there is no
need for dynamic allocation.

For a single-row function with ~8–10 columns, saving one palloc is a
micro-optimization, not a major performance issue. I am not that sure of
the benefit it brings, still preparing a small patch to turn the palloc
into a fixed stack array.

--
Best,
Xuneng

#12

michael@paquier.xyz

about 2 months ago

In reply to: Xuneng Zhou (#11)

Re: Add pg_stat_recovery system view

On Fri, Mar 06, 2026 at 02:21:49PM +0800, Xuneng Zhou wrote:

For a single-row function with ~8–10 columns, saving one palloc is a
micro-optimization, not a major performance issue. I am not that sure of
the benefit it brings, still preparing a small patch to turn the palloc
into a fixed stack array.

Yes, I don't see a point in changing anything here.
--
Michael

#13

li.evan.chao@gmail.com

about 2 months ago

In reply to: Xuneng Zhou (#11)

Re: Add pg_stat_recovery system view

On Mar 6, 2026, at 14:21, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Yuanzhuo,

Thanks for looking into it.

On Fri, Mar 6, 2026 at 1:48 PM yangyz <1197620467@qq.com> wrote:
I reviewed the patch you submitted and identified two issues.

1.In the Add pg_stat_recovery system view patch file, the documentation
modification indicates that the lack of permissions only results in the inability
to view a few specific columns. But the implementation of the code is:

if (! has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS))
PG_RETURN_NULL(); If there is no permission, return an empty line. This is inconsistent with the written document.

Yeah, this is a mismatch in the patch v3. However, it's been removed by Michael in the commit. So it should be fine in HEAD.
2.In the function pg_stat_get_recovery(), the two arrays "Datum *values" and "bool *nulls" consist of a fixed set of seven elements, so there is no need for dynamic allocation.

For a single-row function with ~8–10 columns, saving one palloc is a micro-optimization, not a major performance issue. I am not that sure of the benefit it brings, still preparing a small patch to turn the palloc into a fixed stack array.
--
Best,
Xuneng

Hi Xuneng,

I was reviewing the patch. Obviously, Michael was lightning fast and has already pushed 0001. I have one small additional comment on pushed 0001.
```
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
```

This uses elog(ERROR), while the other functions in the same file use ereport(ERROR). I think ereport is generally preferred nowadays over elog. This is a minor point, so only fix it if you have a chance.

I will send my comments on the remaining two patches separately.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#14

[1]: https://www.postgresql.org/docs/devel/error-message-reporting.html

li.evan.chao@gmail.com

about 2 months ago

In reply to: Xuneng Zhou (#5)

Re: Add pg_stat_recovery system view

On Mar 5, 2026, at 21:01, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Thu, Mar 5, 2026 at 3:37 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi Michael,

Thanks for the detailed feedback!

On Thu, Mar 5, 2026 at 11:58 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Mar 04, 2026 at 09:02:29AM +0800, Xuneng Zhou wrote:

Just rebase.

I have applied 0001, that simply moves some code around.

Thanks for applying.

Regarding 0002, I would recommend to not bump the catalog version in
catversion.h when sending a patch to the lists. This task is left to
committers when the code gets merged into the tree. And this is
annoying for submitters because it can create a lot of conflicts.

I'll leave it untouched.

Using a set-returning function is I think wrong here, betraying the
representation of the recovery status as stored in the system. We
know that there is only one state of recovery, fixed in shared memory.
Like the cousins of this new function, let's make thinks non-SRF,
returning one row all the time with PG_RETURN_NULL() if the conditions
for information display are not satisfied. When we are not in
recovery or when the role querying the function is not granted
ROLE_PG_READ_ALL_STATS, that will simplify the patch as there is no
need for the else branch with the nulls, as written in your patch.
The field values are acquired the right way, spinlock acquisitions
have to be short.

My earlier thought for keeping pg_stat_get_recovery as an SRF is to
make pg_stat_recovery simpler by avoiding an extra filter such as
WHERE s.promote_triggered IS NOT NULL to preserve 0/1-row semantics.
pg_stat_get_recovery_prefetch also uses the SRF pattern.

Like pg_stat_wal_receiver, let's add to the view definition a qual to
return a row only if the fields are not NULL.

pg_get_wal_replay_pause_state() displays the pause_state, and it is
not hidden behind the stats read role. I am not really convinced that
this is worth bothering to treat as an exception in this patch. It
makes it a bit more complicated, for little gain. I would just hide
all the fields behind the role granted, to keep the code simpler, even
if that's slightly inconsistent with pg_get_wal_replay_pause_state().

I agree that exposing a subset of columns unconditionally is not worth
the added complexity.

After writing this last point, as far as I can see there is a small
optimization possible in the patch. When a role is not granted
ROLE_PG_READ_ALL_STATS, we know that we will not return any
information so we could skip the spinlock acquisition and avoid
spinlock acquisitions when one queries the function but has no access
to its data.

Makes sense to me. I'll apply it as suggested.

+ True if a promotion has been triggered for this standby server.

Standby servers are implied if data is returned, this sentence can be
simpler and cut the "standby server" part.

+ 1
+       Start LSN of the last WAL record replayed during recovery.
[...]
+       End LSN of the last WAL record replayed during recovery.
[...]
+       Timeline of the last replayed WAL record.
For other system views with LSN information, we don't use "LSN", but
"write-ahead log location".  I'd suggest the same term here.  These
three fields refer to the last record *successfully* replayed.  It
seems important to mention this fact, I guess?
I'll replace them.
+       <structfield>replay_end_lsn</structfield> <type>pg_lsn</type>
+      </para>
+      <para>
+       Current replay position. When replaying a record, this is the end
+       position of the record being replayed; otherwise it equals
+       <structfield>last_replayed_end_lsn</structfield>.
Slightly inexact. This is the end LSN + 1.
Yeh, this needs to be corrected.
+       <structfield>replay_end_lsn</structfield> <type>pg_lsn</type>
[..]
+       <structfield>replay_end_tli</structfield> <type>integer</type>
Okay, I can fall behind the addition of these two, it could be helpful
in debugging something like a locking issue when replaying a specific
record, I guess. At least I'd want to know what is happening for a
record currently being replayed. It seems to me that this could be
more precise, mentioning that these refer to a record *currently*
being replayed.
I will adjust the docs to say these describe the record currently
being replayed, with replay_end_lsn being the end position + 1.

Is recoveryLastXTime actually that relevant to have? We use it for
logging and for recovery target times, which is something less
relevant than the other fields, especially if we think about standbys
where these have no targets to reach.

I agree it is less central for standby monitoring (and partly overlaps
with pg_last_xact_replay_timestamp()), so I’ll remove it from this
view in the next revision.

currentChunkStartTime, on the contrary, is much more relevant to me,
due to the fact that we use it in WaitForWALToBecomeAvailable() with
active WAL receivers running.

Yeah, it could be useful for apply-delay/catch-up diagnostics.
Here is the updated patch set. Please take a look.

--
Best,
Xuneng
<v3-0001-Add-pg_stat_recovery-system-view.patch><v3-0002-Refactor-move-XLogSource-enum-to-xlogrecovery.h.patch><v3-0003-Add-wal_source-column-to-pg_stat_recovery.patch>

0001 has been pushed.

0002 is pure refactoring and only moves the structure XLogSource from .h to .c, thus no comment.

A few comments on 0003.

1
```
+	/*
+	 * Source from which the startup process most recently read WAL data.
+	 * Updated when the startup process successfully reads WAL from a source.
+	 * Note: this reflects the read source, not the original receipt source;
+	 * streamed WAL read from local files will show XLOG_FROM_PG_WAL.
+	 */
+	XLogSource	lastReadSource;
```

This comment says, "streamed WAL read from local files will show XLOG_FROM_PG_WAL”. But I don’t see how the conversion happens. In WaitForWALToBecomeAvailable(), there is a path XLogFileRead() is called with XLOG_FROM_STREAM, but in XLogFileRead(), source is directly assigned to XLogRecoveryCtl->lastReadSource without conversion.

On the other side, if “stream” is impossible, then the doc should not mention it:
```
+         <para>
+          <literal>stream</literal>: WAL actively being streamed from the
+          upstream server.
+         </para>
```

2 Related to 1. WRT the new column name wal_source, it sounds like “where WAL is coming from”. Say the comment of lastReadSource is true, WAL from stream is shown as pg_wal, “wal source” sounds more like stream, and “wal read source” is pg_wal. From this perspective, I just feel the new column name should be something like “wal_read_source”.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#15

Tom Lane

tgl@sss.pgh.pa.us

about 2 months ago

In reply to: Chao Li (#13)

Re: Add pg_stat_recovery system view

Chao Li <li.evan.chao@gmail.com> writes:

I have one small additional comment on pushed 0001.
```
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
```

This uses elog(ERROR), while the other functions in the same file use ereport(ERROR). I think ereport is generally preferred nowadays over elog.

No: you are incorrect and this snippet is perfectly normal (in fact,
probably copied-and-pasted from one of many other occurrences).
The actual coding rule is basically "use ereport() for user-facing
errors and elog() for not-supposed-to-happen errors". What we're
after is to not expend translator effort on not-supposed-to-happen
error messages. While you can build a ereport call that's not
translated, elog() is a lower-notation way to get the same result.
See [1]https://www.postgresql.org/docs/devel/error-message-reporting.html, particularly the elog() discussion near the end of the
page.

I've not read the patch so I don't know if it made sane ereport-vs-
elog choices elsewhere, but this one is fine.

regards, tom lane

#16

xunengzhou@gmail.com

about 2 months ago

In reply to: Chao Li (#14)

Re: Add pg_stat_recovery system view

Hi,

Thanks for looking into this.

On Fri, Mar 6, 2026 at 3:32 PM Chao Li <li.evan.chao@gmail.com> wrote:

On Mar 5, 2026, at 21:01, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Thu, Mar 5, 2026 at 3:37 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi Michael,

Thanks for the detailed feedback!

On Thu, Mar 5, 2026 at 11:58 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Mar 04, 2026 at 09:02:29AM +0800, Xuneng Zhou wrote:

Just rebase.

I have applied 0001, that simply moves some code around.

Thanks for applying.

Regarding 0002, I would recommend to not bump the catalog version in
catversion.h when sending a patch to the lists. This task is left to
committers when the code gets merged into the tree. And this is
annoying for submitters because it can create a lot of conflicts.

I'll leave it untouched.

Using a set-returning function is I think wrong here, betraying the
representation of the recovery status as stored in the system. We
know that there is only one state of recovery, fixed in shared memory.
Like the cousins of this new function, let's make thinks non-SRF,
returning one row all the time with PG_RETURN_NULL() if the conditions
for information display are not satisfied. When we are not in
recovery or when the role querying the function is not granted
ROLE_PG_READ_ALL_STATS, that will simplify the patch as there is no
need for the else branch with the nulls, as written in your patch.
The field values are acquired the right way, spinlock acquisitions
have to be short.

My earlier thought for keeping pg_stat_get_recovery as an SRF is to
make pg_stat_recovery simpler by avoiding an extra filter such as
WHERE s.promote_triggered IS NOT NULL to preserve 0/1-row semantics.
pg_stat_get_recovery_prefetch also uses the SRF pattern.

Like pg_stat_wal_receiver, let's add to the view definition a qual to
return a row only if the fields are not NULL.

pg_get_wal_replay_pause_state() displays the pause_state, and it is
not hidden behind the stats read role. I am not really convinced that
this is worth bothering to treat as an exception in this patch. It
makes it a bit more complicated, for little gain. I would just hide
all the fields behind the role granted, to keep the code simpler, even
if that's slightly inconsistent with pg_get_wal_replay_pause_state().

I agree that exposing a subset of columns unconditionally is not worth
the added complexity.

After writing this last point, as far as I can see there is a small
optimization possible in the patch. When a role is not granted
ROLE_PG_READ_ALL_STATS, we know that we will not return any
information so we could skip the spinlock acquisition and avoid
spinlock acquisitions when one queries the function but has no access
to its data.

Makes sense to me. I'll apply it as suggested.

+ True if a promotion has been triggered for this standby server.

Standby servers are implied if data is returned, this sentence can be
simpler and cut the "standby server" part.

+ 1
+       Start LSN of the last WAL record replayed during recovery.
[...]
+       End LSN of the last WAL record replayed during recovery.
[...]
+       Timeline of the last replayed WAL record.
For other system views with LSN information, we don't use "LSN", but
"write-ahead log location".  I'd suggest the same term here.  These
three fields refer to the last record *successfully* replayed.  It
seems important to mention this fact, I guess?
I'll replace them.
+       <structfield>replay_end_lsn</structfield> <type>pg_lsn</type>
+      </para>
+      <para>
+       Current replay position. When replaying a record, this is the end
+       position of the record being replayed; otherwise it equals
+       <structfield>last_replayed_end_lsn</structfield>.
Slightly inexact. This is the end LSN + 1.
Yeh, this needs to be corrected.
+       <structfield>replay_end_lsn</structfield> <type>pg_lsn</type>
[..]
+       <structfield>replay_end_tli</structfield> <type>integer</type>
Okay, I can fall behind the addition of these two, it could be helpful
in debugging something like a locking issue when replaying a specific
record, I guess. At least I'd want to know what is happening for a
record currently being replayed. It seems to me that this could be
more precise, mentioning that these refer to a record *currently*
being replayed.
I will adjust the docs to say these describe the record currently
being replayed, with replay_end_lsn being the end position + 1.

Is recoveryLastXTime actually that relevant to have? We use it for
logging and for recovery target times, which is something less
relevant than the other fields, especially if we think about standbys
where these have no targets to reach.

I agree it is less central for standby monitoring (and partly overlaps
with pg_last_xact_replay_timestamp()), so I’ll remove it from this
view in the next revision.

currentChunkStartTime, on the contrary, is much more relevant to me,
due to the fact that we use it in WaitForWALToBecomeAvailable() with
active WAL receivers running.

Yeah, it could be useful for apply-delay/catch-up diagnostics.
Here is the updated patch set. Please take a look.

--
Best,
Xuneng
<v3-0001-Add-pg_stat_recovery-system-view.patch><v3-0002-Refactor-move-XLogSource-enum-to-xlogrecovery.h.patch><v3-0003-Add-wal_source-column-to-pg_stat_recovery.patch>
0001 has been pushed.

0002 is pure refactoring and only moves the structure XLogSource from .h to .c, thus no comment.

A few comments on 0003.
1
```
+       /*
+        * Source from which the startup process most recently read WAL data.
+        * Updated when the startup process successfully reads WAL from a source.
+        * Note: this reflects the read source, not the original receipt source;
+        * streamed WAL read from local files will show XLOG_FROM_PG_WAL.
+        */
+       XLogSource      lastReadSource;
```
This comment says, "streamed WAL read from local files will show XLOG_FROM_PG_WAL”. But I don’t see how the conversion happens. In WaitForWALToBecomeAvailable(), there is a path XLogFileRead() is called with XLOG_FROM_STREAM, but in XLogFileRead(), source is directly assigned to XLogRecoveryCtl->lastReadSource without conversion.
On the other side, if “stream” is impossible, then the doc should not mention it:
```
+         <para>
+          <literal>stream</literal>: WAL actively being streamed from the
+          upstream server.
+         </para>
```

you're right, the comment is wrong. There is no conversion. In
WaitForWALToBecomeAvailable(), when the walreceiver has flushed data
ahead of the startup process, the code deliberately opens the segment
file with XLOG_FROM_STREAM , and the existing comment explains the
intent:

* Use XLOG_FROM_STREAM so that source info is set correctly
* and XLogReceiptTime isn't changed.

XLogFileRead() then assigns that directly to lastReadSource without
conversion. So XLOG_FROM_STREAM is a valid observable value — it means
the segment was opened during active walreceiver streaming, even
though the actual I/O is against a local file in pg_wal. I think
stream documentation entry should stay.

2 Related to 1. WRT the new column name wal_source, it sounds like “where WAL is coming from”. Say the comment of lastReadSource is true, WAL from stream is shown as pg_wal, “wal source” sounds more like stream, and “wal read source” is pg_wal. From this perspective, I just feel the new column name should be something like “wal_read_source”.

agreed, wal_read_source seems to be a better name. It more closely
mirrors the underlying lastReadSource field and avoids ambiguity about
whether the column describes the delivery mechanism or the read path.

Please check the updated patches.

--
Best,
Xuneng

#17

xunengzhou@gmail.com

about 2 months ago

In reply to: Tom Lane (#15)

Re: Add pg_stat_recovery system view

Hi Tom,

On Fri, Mar 6, 2026 at 11:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Chao Li <li.evan.chao@gmail.com> writes:

I have one small additional comment on pushed 0001.
```
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
```

This uses elog(ERROR), while the other functions in the same file use ereport(ERROR). I think ereport is generally preferred nowadays over elog.

No: you are incorrect and this snippet is perfectly normal (in fact,
probably copied-and-pasted from one of many other occurrences).
The actual coding rule is basically "use ereport() for user-facing
errors and elog() for not-supposed-to-happen errors". What we're
after is to not expend translator effort on not-supposed-to-happen
error messages. While you can build a ereport call that's not
translated, elog() is a lower-notation way to get the same result.
See [1], particularly the elog() discussion near the end of the
page.

I've not read the patch so I don't know if it made sane ereport-vs-
elog choices elsewhere, but this one is fine.

regards, tom lane

[1] https://www.postgresql.org/docs/devel/error-message-reporting.html

Thanks for your clarification!

--
Best,
Xuneng

#18

li.evan.chao@gmail.com

about 2 months ago

In reply to: Tom Lane (#15)

Re: Add pg_stat_recovery system view

On Mar 6, 2026, at 23:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Chao Li <li.evan.chao@gmail.com> writes:

I have one small additional comment on pushed 0001.
```
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
```

This uses elog(ERROR), while the other functions in the same file use ereport(ERROR). I think ereport is generally preferred nowadays over elog.

No: you are incorrect and this snippet is perfectly normal (in fact,
probably copied-and-pasted from one of many other occurrences).
The actual coding rule is basically "use ereport() for user-facing
errors and elog() for not-supposed-to-happen errors". What we're
after is to not expend translator effort on not-supposed-to-happen
error messages. While you can build a ereport call that's not
translated, elog() is a lower-notation way to get the same result.
See [1], particularly the elog() discussion near the end of the
page.

I've not read the patch so I don't know if it made sane ereport-vs-
elog choices elsewhere, but this one is fine.

regards, tom lane

[1] https://www.postgresql.org/docs/devel/error-message-reporting.html

Hi Tom, Thanks for the detailed explanation. Noted.

Would it make sense to mention this distinction in the header comments for ereport and elog in the code? For these logging APIs, I think developers are often more likely to read the code comments than the documentation.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#19

michael@paquier.xyz

about 2 months ago

In reply to: Chao Li (#18)

Re: Add pg_stat_recovery system view

On Mon, Mar 09, 2026 at 01:20:45PM +0800, Chao Li wrote:

[1] https://www.postgresql.org/docs/devel/error-message-reporting.html

Would it make sense to mention this distinction in the header
comments for ereport and elog in the code? For these logging APIs, I
think developers are often more likely to read the code comments
than the documentation.

As of the link already mentioned:
"Therefore, elog should be used only for internal errors .."
--
Michael

#20