Report checkpoint progress in server logs

Started by Bharath Rupireddyabout 4 years ago97 messages

bharath.rupireddyforpostgres@gmail.com

about 4 years ago

Hi,

At times, some of the checkpoint operations such as removing old WAL
files, dealing with replication snapshot or mapping files etc. may
take a while during which the server doesn't emit any logs or
information, the only logs emitted are LogCheckpointStart and
LogCheckpointEnd. Many times this isn't a problem if the checkpoint is
quicker, but there can be extreme situations which require the users
to know what's going on with the current checkpoint.

Given that the commit 9ce346ea [1]commit 9ce346eabf350a130bba46be3f8c50ba28506969 Author: Robert Haas <rhaas@postgresql.org> Date: Mon Oct 25 11:51:57 2021 -0400 introduced a nice mechanism to
report the long running operations of the startup process in the
server logs, I'm thinking we can have a similar progress mechanism for
the checkpoint as well. There's another idea suggested in a couple of
other threads to have a pg_stat_progress_checkpoint similar to
pg_stat_progress_analyze/vacuum/etc. But the problem with this idea is
during the end-of-recovery or shutdown checkpoints, the
pg_stat_progress_checkpoint view isn't accessible as it requires a
connection to the server which isn't allowed.

Therefore, reporting the checkpoint progress in the server logs, much
like [1]commit 9ce346eabf350a130bba46be3f8c50ba28506969 Author: Robert Haas <rhaas@postgresql.org> Date: Mon Oct 25 11:51:57 2021 -0400, seems to be the best way IMO. We can 1) either make
ereport_startup_progress and log_startup_progress_interval more
generic (something like ereport_log_progress and
log_progress_interval), move the code to elog.c, use it for
checkpoint progress and if required for other time-consuming
operations 2) or have an entirely different GUC and API for checkpoint
progress.

IMO, option (1) i.e. ereport_log_progress and log_progress_interval
(better names are welcome) seems a better idea.

Thoughts?

[1]: commit 9ce346eabf350a130bba46be3f8c50ba28506969 Author: Robert Haas <rhaas@postgresql.org> Date: Mon Oct 25 11:51:57 2021 -0400
commit 9ce346eabf350a130bba46be3f8c50ba28506969
Author: Robert Haas <rhaas@postgresql.org>
Date: Mon Oct 25 11:51:57 2021 -0400

Report progress of startup operations that take a long time.

Regards,
Bharath Rupireddy.

Magnus Hagander

magnus@hagander.net

about 4 years ago

In reply to: Bharath Rupireddy (#1)

Re: Report checkpoint progress in server logs

On Wed, Dec 29, 2021 at 3:31 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Hi,

At times, some of the checkpoint operations such as removing old WAL
files, dealing with replication snapshot or mapping files etc. may
take a while during which the server doesn't emit any logs or
information, the only logs emitted are LogCheckpointStart and
LogCheckpointEnd. Many times this isn't a problem if the checkpoint is
quicker, but there can be extreme situations which require the users
to know what's going on with the current checkpoint.

Given that the commit 9ce346ea [1] introduced a nice mechanism to
report the long running operations of the startup process in the
server logs, I'm thinking we can have a similar progress mechanism for
the checkpoint as well. There's another idea suggested in a couple of
other threads to have a pg_stat_progress_checkpoint similar to
pg_stat_progress_analyze/vacuum/etc. But the problem with this idea is
during the end-of-recovery or shutdown checkpoints, the
pg_stat_progress_checkpoint view isn't accessible as it requires a
connection to the server which isn't allowed.

Therefore, reporting the checkpoint progress in the server logs, much
like [1], seems to be the best way IMO. We can 1) either make
ereport_startup_progress and log_startup_progress_interval more
generic (something like ereport_log_progress and
log_progress_interval), move the code to elog.c, use it for
checkpoint progress and if required for other time-consuming
operations 2) or have an entirely different GUC and API for checkpoint
progress.

IMO, option (1) i.e. ereport_log_progress and log_progress_interval
(better names are welcome) seems a better idea.

Thoughts?

I find progress reporting in the logfile to generally be a terrible
way of doing things, and the fact that we do it for the startup
process is/should be only because we have no other choice, not because
it's the right choice.

I think the right choice to solve the *general* problem is the
mentioned pg_stat_progress_checkpoints.

We may want to *additionally* have the ability to log the progress
specifically for the special cases when we're not able to use that
view. And in those case, we can perhaps just use the existing
log_startup_progress_interval parameter for this as well -- at least
for the startup checkpoint.

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Tom Lane

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Magnus Hagander (#2)

Re: Report checkpoint progress in server logs

Magnus Hagander <magnus@hagander.net> writes:

Therefore, reporting the checkpoint progress in the server logs, much
like [1], seems to be the best way IMO.

I find progress reporting in the logfile to generally be a terrible
way of doing things, and the fact that we do it for the startup
process is/should be only because we have no other choice, not because
it's the right choice.

I'm already pretty seriously unhappy about the log-spamming effects of
64da07c41 (default to log_checkpoints=on), and am willing to lay a side
bet that that gets reverted after we have some field experience with it.
This proposal seems far worse from that standpoint. Keep in mind that
our out-of-the-box logging configuration still doesn't have any log
rotation ability, which means that the noisier the server is in normal
operation, the sooner you fill your disk.

I think the right choice to solve the *general* problem is the
mentioned pg_stat_progress_checkpoints.

regards, tom lane

SATYANARAYANA NARLAPURAM

satyanarlapuram@gmail.com

about 4 years ago

In reply to: Tom Lane (#3)

Re: Report checkpoint progress in server logs

Coincidentally, I was thinking about the same yesterday after tired of
waiting for the checkpoint completion on a server.

On Wed, Dec 29, 2021 at 7:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:

Therefore, reporting the checkpoint progress in the server logs, much
like [1], seems to be the best way IMO.

I find progress reporting in the logfile to generally be a terrible
way of doing things, and the fact that we do it for the startup
process is/should be only because we have no other choice, not because
it's the right choice.

I'm already pretty seriously unhappy about the log-spamming effects of
64da07c41 (default to log_checkpoints=on), and am willing to lay a side
bet that that gets reverted after we have some field experience with it.
This proposal seems far worse from that standpoint. Keep in mind that
our out-of-the-box logging configuration still doesn't have any log
rotation ability, which means that the noisier the server is in normal
operation, the sooner you fill your disk.

Server is not open up for the queries while running the end of recovery
checkpoint and a catalog view may not help here but the process title
change or logging would be helpful in such cases. When the server is
running the recovery, anxious customers ask several times the ETA for
recovery completion, and not having visibility into these operations makes
life difficult for the DBA/operations.

I think the right choice to solve the *general* problem is the
mentioned pg_stat_progress_checkpoints.

+1

+1 to this. We need at least a trace of the number of buffers to sync
(num_to_scan) before the checkpoint start, instead of just emitting the
stats at the end.

Bharat, it would be good to show the buffers synced counter and the total
buffers to sync, checkpointer pid, substep it is running, whether it is on
target for completion, checkpoint_Reason (manual/times/forced). BufferSync
has several variables tracking the sync progress locally, and we may need
some refactoring here.

Show quoted text

regards, tom lane

Michael Paquier

michael@paquier.xyz

about 4 years ago

In reply to: Tom Lane (#3)

Re: Report checkpoint progress in server logs

On Wed, Dec 29, 2021 at 10:40:59AM -0500, Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

I think the right choice to solve the *general* problem is the
mentioned pg_stat_progress_checkpoints.

+1

Agreed. I don't see why this would not work as there are
PgBackendStatus entries for each auxiliary process.
--
Michael

Bruce Momjian

bruce@momjian.us

about 4 years ago

In reply to: Tom Lane (#3)

Re: Report checkpoint progress in server logs

On Wed, Dec 29, 2021 at 10:40:59AM -0500, Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

Therefore, reporting the checkpoint progress in the server logs, much
like [1], seems to be the best way IMO.

I find progress reporting in the logfile to generally be a terrible
way of doing things, and the fact that we do it for the startup
process is/should be only because we have no other choice, not because
it's the right choice.

I'm already pretty seriously unhappy about the log-spamming effects of
64da07c41 (default to log_checkpoints=on), and am willing to lay a side
bet that that gets reverted after we have some field experience with it.
This proposal seems far worse from that standpoint. Keep in mind that
our out-of-the-box logging configuration still doesn't have any log
rotation ability, which means that the noisier the server is in normal
operation, the sooner you fill your disk.

I think we are looking at three potential observable behaviors people
might care about:

* the current activity/progress of checkpoints
* the historical reporting of checkpoint completion, mixed in with other
log messages for later analysis
* the aggregate behavior of checkpoint operation

I think it is clear that checkpoint progress activity isn't useful for
the server logs because that information has little historical value,
but does fit for a progress view. As Tom already expressed, we will
have to wait to see if non-progress checkpoint information in the logs
has sufficient historical value.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Bruce Momjian (#6)

Re: Report checkpoint progress in server logs

I think the right choice to solve the *general* problem is the
mentioned pg_stat_progress_checkpoints.

We may want to *additionally* have the ability to log the progress
specifically for the special cases when we're not able to use that
view. And in those case, we can perhaps just use the existing
log_startup_progress_interval parameter for this as well -- at least
for the startup checkpoint.

We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of just emitting the stats at the end.

Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid, substep it is running, whether it is on target for completion, checkpoint_Reason
(manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some refactoring here.

I agree to provide above mentioned information as part of showing the
progress of current checkpoint operation. I am currently looking into
the code to know if any other information can be added.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Thu, Jan 6, 2022 at 5:12 AM Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Dec 29, 2021 at 10:40:59AM -0500, Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

Therefore, reporting the checkpoint progress in the server logs, much
like [1], seems to be the best way IMO.

I find progress reporting in the logfile to generally be a terrible
way of doing things, and the fact that we do it for the startup
process is/should be only because we have no other choice, not because
it's the right choice.

I'm already pretty seriously unhappy about the log-spamming effects of
64da07c41 (default to log_checkpoints=on), and am willing to lay a side
bet that that gets reverted after we have some field experience with it.
This proposal seems far worse from that standpoint. Keep in mind that
our out-of-the-box logging configuration still doesn't have any log
rotation ability, which means that the noisier the server is in normal
operation, the sooner you fill your disk.

I think we are looking at three potential observable behaviors people
might care about:

* the current activity/progress of checkpoints
* the historical reporting of checkpoint completion, mixed in with other
log messages for later analysis
* the aggregate behavior of checkpoint operation

I think it is clear that checkpoint progress activity isn't useful for
the server logs because that information has little historical value,
but does fit for a progress view. As Tom already expressed, we will
have to wait to see if non-progress checkpoint information in the logs
has sufficient historical value.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#7)

Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Fri, Jan 21, 2022 at 11:07 AM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

I think the right choice to solve the *general* problem is the
mentioned pg_stat_progress_checkpoints.

We may want to *additionally* have the ability to log the progress
specifically for the special cases when we're not able to use that
view. And in those case, we can perhaps just use the existing
log_startup_progress_interval parameter for this as well -- at least
for the startup checkpoint.

+1

We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of just emitting the stats at the end.

Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid, substep it is running, whether it is on target for completion, checkpoint_Reason
(manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some refactoring here.

I agree to provide above mentioned information as part of showing the
progress of current checkpoint operation. I am currently looking into
the code to know if any other information can be added.

As suggested in the other thread by Julien, I'm changing the subject
of this thread to reflect the discussion.

Regards,
Bharath Rupireddy.

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Bharath Rupireddy (#8)

1 attachment(s)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of just emitting the stats at the end.

Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid, substep it is running, whether it is on target for completion, checkpoint_Reason
(manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some refactoring here.

I agree to provide above mentioned information as part of showing the
progress of current checkpoint operation. I am currently looking into
the code to know if any other information can be added.

Here is the initial patch to show the progress of checkpoint through
pg_stat_progress_checkpoint view. Please find the attachment.

The information added to this view are pid - process ID of a
CHECKPOINTER process, kind - kind of checkpoint indicates the reason
for checkpoint (values can be wal, time or force), phase - indicates
the current phase of checkpoint operation, total_buffer_writes - total
number of buffers to be written, buffers_processed - number of buffers
processed, buffers_written - number of buffers written,
total_file_syncs - total number of files to be synced, files_synced -
number of files synced.

There are many operations happen as part of checkpoint. For each of
the operation I am updating the phase field of
pg_stat_progress_checkpoint view. The values supported for this field
are initializing, checkpointing replication slots, checkpointing
snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
buffers, performing sync requests, performing two phase checkpoint,
recycling old XLOG files and Finalizing. In case of checkpointing
buffers phase, the fields total_buffer_writes, buffers_processed and
buffers_written shows the detailed progress of writing buffers. In
case of performing sync requests phase, the fields total_file_syncs
and files_synced shows the detailed progress of syncing files. In
other phases, only the phase field is getting updated and it is
difficult to show the progress because we do not get the total number
of files count without traversing the directory. It is not worth to
calculate that as it affects the performance of the checkpoint. I also
gave a thought to just mention the number of files processed, but this
wont give a meaningful progress information (It can be treated as
statistics). Hence just updating the phase field in those scenarios.

Apart from above fields, I am planning to add few more fields to the
view in the next patch. That is, process ID of the backend process
which triggered a CHECKPOINT command, checkpoint start location, filed
to indicate whether it is a checkpoint or restartpoint and elapsed
time of the checkpoint operation. Please share your thoughts. I would
be happy to add any other information that contributes to showing the
progress of checkpoint.

As per the discussion in this thread, there should be some mechanism
to show the progress of checkpoint during shutdown and end-of-recovery
cases as we cannot access pg_stat_progress_checkpoint in those cases.
I am working on this to use log_startup_progress_interval mechanism to
log the progress in the server logs.

Kindly review the patch and share your thoughts.

On Fri, Jan 28, 2022 at 12:24 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Show quoted text

On Fri, Jan 21, 2022 at 11:07 AM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

I think the right choice to solve the *general* problem is the
mentioned pg_stat_progress_checkpoints.

We may want to *additionally* have the ability to log the progress
specifically for the special cases when we're not able to use that
view. And in those case, we can perhaps just use the existing
log_startup_progress_interval parameter for this as well -- at least
for the startup checkpoint.

+1

We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of just emitting the stats at the end.

Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid, substep it is running, whether it is on target for completion, checkpoint_Reason
(manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some refactoring here.

I agree to provide above mentioned information as part of showing the
progress of current checkpoint operation. I am currently looking into
the code to know if any other information can be added.

As suggested in the other thread by Julien, I'm changing the subject
of this thread to reflect the discussion.

Regards,
Bharath Rupireddy.

Attachments:

0001-pg_stat_progress_checkpoint-view.patchapplication/octet-stream; name=0001-pg_stat_progress_checkpoint-view.patchDownload

From 5b68df74d68922685dff4eb665a6431c5046f848 Mon Sep 17 00:00:00 2001
From: Nitin Jadhav <nitinjadhav@microsoft.com>
Date: Wed, 9 Feb 2022 12:00:11 +0000
Subject: [PATCH] pg_stat_progress_checkpoint view

---
 doc/src/sgml/monitoring.sgml         | 263 +++++++++++++++++++++++++++
 src/backend/access/transam/xlog.c    | 123 ++++++++++++-
 src/backend/catalog/system_views.sql |  29 +++
 src/backend/storage/buffer/bufmgr.c  |  10 +
 src/backend/storage/sync/sync.c      |  10 +-
 src/backend/utils/adt/pgstatfuncs.c  |   2 +
 src/include/access/xlog.h            |   4 +
 src/include/commands/progress.h      |  31 ++++
 src/include/storage/sync.h           |   2 +-
 src/include/utils/backend_progress.h |   3 +-
 10 files changed, 473 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 62f2a3332b..a6893d4543 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -401,6 +401,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
        See <xref linkend='copy-progress-reporting'/>.
       </entry>
      </row>
+
+     <row>
+      <entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
+      <entry>One row only, showing the progress of <command>CHECKPOINT</command> operation.
+       See <xref linkend='checkpoint-progress-reporting'/>.
+      </entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
@@ -6886,6 +6893,262 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   </table>
  </sect2>
 
+ <sect2 id="checkpoint-progress-reporting">
+  <title>CHECKPOINT Progress Reporting</title>
+
+  <indexterm>
+   <primary>pg_stat_progress_checkpoint</primary>
+  </indexterm>
+
+  <para>
+   Whenever the checkpoint operation is running, the
+   <structname>pg_stat_progress_checkpoint</structname> view will contain a
+   single row indicating the progress of checkpoint operation. The tables below
+   describe the information that will be reported and provide information about
+   how to interpret it.
+  </para>
+
+  <table id="pg-stat-progress-checkpoint-view" xreflabel="pg_stat_progress_checkpoint">
+   <title><structname>pg_stat_progress_checkpoint</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>pid</structfield> <type>integer</type>
+      </para>
+      <para>
+       Process ID of a CHECKPOINTER process.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>kinds</structfield> <type>text</type>
+      </para>
+      <para>
+       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>phase</structfield> <type>text</type>
+      </para>
+      <para>
+       Current processing phase. See <xref linkend="checkpoint-phases"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>total_buffer_writes</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of buffers to be written. This is estimated and reported
+       as of the beginning of buffer write operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_processed</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers processed. This counter increases when the targeted
+       buffer is processed. This number will eventually become equal to
+       <literal> total_buffer_writes </literal> when the checkpoint is
+       complete.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers written. This counter only advances when the targeted
+       buffers is written. Note that some of the buffers are processed but may
+       not required to be written. So this count will always be  less than or
+       equal to  <literal>total_buffer_writes</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>total_file_syncs</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of files to be synced. This is estimated and reported as of
+       the beginning of sync operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of files synced. This counter advances when the targeted file is
+       synced. This number will eventually become equal to
+       <literal>total_file_syncs</literal>  when the checkpoint is complete.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-kinds">
+   <title>CHECKPOINT kinds</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Kinds</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>wal</literal></entry>
+      <entry>
+       The checkpoint operation is requested due to XLOG filling.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>time</literal></entry>
+      <entry>
+       The checkpoint operation is requested due to timeout.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>force</literal></entry>
+      <entry>
+       The checkpoint operation is forced even if no XLOG activity has occurred
+       since the last one.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-phases">
+   <title>CHECKPOINT phases</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Phase</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>initializing</literal></entry>
+      <entry>
+       The CHECKPOINTER process is preparing to begin the checkpoint operation.
+       This phase is expected to be very brief.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing replication slots</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently flushing all the replication slots to
+       disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing snapshots</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently removing all the serialized
+       snapshots that are not required anymore.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing logical rewrite mappings</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently removing/flushing the logical
+       rewrite mappings.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing CLOG pages</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently writing CLOG pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing CommitTs pages</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently writing CommitTs pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing SUBTRANS pages</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently writing SUBTRANS pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing MULTIXACT pages</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently writing MULTIXACT pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing SLRU pages</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently writing SLRU pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing buffers</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently writing buffers to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>performing sync requests</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently performing sync requests.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>performing two phase checkpoint</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently performing two phase checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>recycling old XLOG files</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently recycling old XLOG files.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>finalizing</literal></entry>
+      <entry>
+       The CHECKPOINTER process is finalizing the checkpoint operation. 
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  </sect1>
 
  <sect1 id="dynamic-trace">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 958220c495..df568eecd8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9089,6 +9089,9 @@ CreateCheckPoint(int flags)
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
 
+	/* Prepare to report progress of the checkpoint. */
+	checkpoint_progress_start(flags);
+
 	/*
 	 * Prepare to accumulate statistics.
 	 *
@@ -9432,8 +9435,12 @@ CreateCheckPoint(int flags)
 		KeepLogSeg(recptr, &_logSegNo);
 	}
 	_logSegNo--;
+	checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_OLD_XLOG_RECYCLE);
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr,
 					   checkPoint.ThisTimeLineID);
+	checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_FINALIZE);
 
 	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
@@ -9455,6 +9462,9 @@ CreateCheckPoint(int flags)
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(false);
 
+	/* Stop reporting progress of the checkpoint. */
+	checkpoint_progress_end(flags);
+
 	/* Reset the process title */
 	update_checkpoint_display(flags, false, true);
 
@@ -9568,29 +9578,60 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
+
+	checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS);
 	CheckPointReplicationSlots();
+
+	checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS);
 	CheckPointSnapBuild();
+
+	checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS);
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
 
 	/* Write out all dirty data in SLRUs and the main buffer pool */
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
+
+	checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES);
 	CheckPointCLOG();
+
+	checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES);
 	CheckPointCommitTs();
+
+	checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES);
 	CheckPointSUBTRANS();
+
+	checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES);
 	CheckPointMultiXact();
+
+	checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
 	CheckPointPredicate();
+
+	checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_BUFFERS);
 	CheckPointBuffers(flags);
 
 	/* Perform all queued up fsyncs */
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
-	ProcessSyncRequests();
+	checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_FILE_SYNC);
+	ProcessSyncRequests(flags);
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 
 	/* We deliberately delay 2PC checkpointing as long as possible */
+	checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_TWO_PHASE);
 	CheckPointTwoPhase(checkPointRedo);
 }
 
@@ -9727,6 +9768,9 @@ CreateRestartPoint(int flags)
 	XLogCtl->RedoRecPtr = lastCheckPoint.redo;
 	SpinLockRelease(&XLogCtl->info_lck);
 
+	/* Prepare to report progress of the checkpoint. */
+	checkpoint_progress_start(flags);
+
 	/*
 	 * Prepare to accumulate statistics.
 	 *
@@ -9837,7 +9881,11 @@ CreateRestartPoint(int flags)
 	if (!RecoveryInProgress())
 		replayTLI = XLogCtl->InsertTimeLineID;
 
+	checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_OLD_XLOG_RECYCLE);
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr, replayTLI);
+	checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_FINALIZE);
 
 	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
@@ -9858,6 +9906,9 @@ CreateRestartPoint(int flags)
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(true);
 
+	/* Stop reporting progress of the checkpoint. */
+	checkpoint_progress_end(flags);
+
 	/* Reset the process title */
 	update_checkpoint_display(flags, true, true);
 
@@ -13242,3 +13293,73 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Start reporting progress of the checkpoint.
+ */
+void
+checkpoint_progress_start(int flags)
+{
+	/* In bootstrap mode, we don't actually record anything. */
+	if (IsBootstrapProcessingMode())
+		return;
+
+	/*
+	 * Cannot access pg_stat_progress_checkpoint view in case of checkpoint
+	 * during shutdown and end-of-recovery.
+	 */
+	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0)
+	{
+		pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
+		checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+										 PROGRESS_CHECKPOINT_PHASE_INIT);
+		if (flags & CHECKPOINT_CAUSE_XLOG)
+			checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
+											 PROGRESS_CHECKPOINT_KIND_WAL);
+		else if (flags & CHECKPOINT_CAUSE_TIME)
+			checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
+											 PROGRESS_CHECKPOINT_KIND_TIME);
+		else if (flags & CHECKPOINT_FORCE)
+			checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
+											 PROGRESS_CHECKPOINT_KIND_FORCE);
+		else
+			checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
+											 PROGRESS_CHECKPOINT_KIND_UNKNOWN);
+	}
+}
+
+/*
+ * Update index'th member in st_progress_param[] array with the latest value.
+ */
+void
+checkpoint_progress_update_param(int flags, int index, int64 val)
+{
+	/* In bootstrap mode, we don't actually record anything. */
+	if (IsBootstrapProcessingMode())
+		return;
+
+	/*
+	 * Cannot access pg_stat_progress_checkpoint view in case of checkpoint
+	 * during shutdown and end-of-recovery.
+	 */
+	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0)
+		pgstat_progress_update_param(index, val);
+}
+
+/*
+ * Stop reporting progress of the checkpoint.
+ */
+void
+checkpoint_progress_end(int flags)
+{
+	/* In bootstrap mode, we don't actually record anything. */
+	if (IsBootstrapProcessingMode())
+		return;
+
+	/*
+	 * Cannot access pg_stat_progress_checkpoint view in case of checkpoint
+	 * during shutdown and end-of-recovery.
+	 */
+	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0)
+		pgstat_progress_end_command();
+}
\ No newline at end of file
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3cb69b1f87..6a90d63fff 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1286,3 +1286,32 @@ CREATE VIEW pg_stat_subscription_workers AS
           FROM pg_subscription_rel) sr,
           LATERAL pg_stat_get_subscription_worker(sr.subid, sr.relid) w
           JOIN pg_subscription s ON (w.subid = s.oid);
+
+CREATE VIEW pg_stat_progress_checkpoint AS
+    SELECT
+        S.pid AS pid,
+        CASE S.param1 WHEN 0 THEN 'wal'
+                      WHEN 1 THEN 'time'
+                      WHEN 2 THEN 'force'
+                      END AS kind,
+        CASE S.param2 WHEN 0 THEN 'initializing'
+                      WHEN 1 THEN 'checkpointing replication slots'
+                      WHEN 2 THEN 'checkpointing snapshots'
+                      WHEN 3 THEN 'checkpointing logical rewrite mappings'
+                      WHEN 4 THEN 'checkpointing CLOG pages'
+                      WHEN 5 THEN 'checkpointing CommitTs pages'
+                      WHEN 6 THEN 'checkpointing SUBTRANS pages'
+                      WHEN 7 THEN 'checkpointing MULTIXACT pages'
+                      WHEN 8 THEN 'checkpointing SLRU pages'
+                      WHEN 9 THEN 'checkpointing buffers'
+                      WHEN 10 THEN 'performing sync requests'
+                      WHEN 11 THEN 'performing two phase checkpoint'
+                      WHEN 12 THEN 'recycling old XLOG files'
+                      WHEN 13 THEN 'Finalizing'
+                      END AS phase,
+        S.param3 AS total_buffer_writes,
+        S.param4 AS buffers_processed,
+        S.param5 AS buffers_written,
+        S.param6 AS total_file_syncs,
+        S.param7 AS files_synced
+    FROM pg_stat_get_progress_info('CHECKPOINT') AS S;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f8..cf0ad299f3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "commands/progress.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -2012,6 +2013,9 @@ BufferSync(int flags)
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
+	checkpoint_progress_update_param(flags,
+									 PROGRESS_CHECKPOINT_TOTAL_BUFFER_WRITES,
+									 num_to_scan);
 
 	/*
 	 * Sort buffers that need to be written to reduce the likelihood of random
@@ -2129,6 +2133,9 @@ BufferSync(int flags)
 		bufHdr = GetBufferDescriptor(buf_id);
 
 		num_processed++;
+		checkpoint_progress_update_param(flags,
+										 PROGRESS_CHECKPOINT_BUFFERS_PROCESSED,
+										 num_processed);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -2149,6 +2156,9 @@ BufferSync(int flags)
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
+				checkpoint_progress_update_param(flags,
+												 PROGRESS_CHECKPOINT_BUFFERS_WRITTEN,
+												 num_written);
 			}
 		}
 
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 543f691f2d..b8f6aebb7c 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -23,6 +23,7 @@
 #include "access/multixact.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "commands/progress.h"
 #include "commands/tablespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -277,7 +278,7 @@ SyncPostCheckpoint(void)
  *	ProcessSyncRequests() -- Process queued fsync requests.
  */
 void
-ProcessSyncRequests(void)
+ProcessSyncRequests(int flags)
 {
 	static bool sync_in_progress = false;
 
@@ -355,6 +356,10 @@ ProcessSyncRequests(void)
 	/* Now scan the hashtable for fsync requests to process */
 	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOps);
+	checkpoint_progress_update_param(flags,
+									 PROGRESS_CHECKPOINT_TOTAL_FILE_SYNCS,
+									 hash_get_num_entries(pendingOps));
+
 	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
 	{
 		int			failures;
@@ -418,6 +423,9 @@ ProcessSyncRequests(void)
 						longest = elapsed;
 					total_elapsed += elapsed;
 					processed++;
+					checkpoint_progress_update_param(flags,
+													 PROGRESS_CHECKPOINT_FILES_SYNCED,
+													 processed);
 
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 15cb17ace4..7438e0ce84 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -494,6 +494,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 		cmdtype = PROGRESS_COMMAND_BASEBACKUP;
 	else if (pg_strcasecmp(cmd, "COPY") == 0)
 		cmdtype = PROGRESS_COMMAND_COPY;
+	 else if (pg_strcasecmp(cmd, "CHECKPOINT") == 0)
+		cmdtype = PROGRESS_COMMAND_CHECKPOINT;
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a4b1c1286f..58c547b2d5 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -353,6 +353,10 @@ extern void do_pg_abort_backup(int code, Datum arg);
 extern void register_persistent_abort_backup_handler(void);
 extern SessionBackupState get_backup_status(void);
 
+extern void checkpoint_progress_start(int flags);
+extern void checkpoint_progress_update_param(int flags, int index, int64 val);
+extern void checkpoint_progress_end(int flags);
+
 /* File path names (all relative to $PGDATA) */
 #define RECOVERY_SIGNAL_FILE	"recovery.signal"
 #define STANDBY_SIGNAL_FILE		"standby.signal"
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index a28938caf4..e1c574d053 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -151,4 +151,35 @@
 #define PROGRESS_COPY_TYPE_PIPE 3
 #define PROGRESS_COPY_TYPE_CALLBACK 4
 
+/* Progress parameters for checkpoint */
+#define PROGRESS_CHECKPOINT_KIND                    0
+#define PROGRESS_CHECKPOINT_PHASE                   1
+#define PROGRESS_CHECKPOINT_TOTAL_BUFFER_WRITES     2
+#define PROGRESS_CHECKPOINT_BUFFERS_PROCESSED       3
+#define PROGRESS_CHECKPOINT_BUFFERS_WRITTEN         4
+#define PROGRESS_CHECKPOINT_TOTAL_FILE_SYNCS        5
+#define PROGRESS_CHECKPOINT_FILES_SYNCED            6
+
+/* Kinds of checkpoint (as advertised via PROGRESS_CHECKPOINT_KIND) */
+#define PROGRESS_CHECKPOINT_KIND_WAL                0
+#define PROGRESS_CHECKPOINT_KIND_TIME               1
+#define PROGRESS_CHECKPOINT_KIND_FORCE              2
+#define PROGRESS_CHECKPOINT_KIND_UNKNOWN            3
+
+/* Phases of checkpoint (as advertised via PROGRESS_CHECKPOINT_PHASE) */
+#define PROGRESS_CHECKPOINT_PHASE_INIT                          0
+#define PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS                   1
+#define PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS                     2
+#define PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS      3
+#define PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES                    4
+#define PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES                5
+#define PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES                6
+#define PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES               7
+#define PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES                    8
+#define PROGRESS_CHECKPOINT_PHASE_BUFFERS                       9
+#define PROGRESS_CHECKPOINT_PHASE_FILE_SYNC                     10
+#define PROGRESS_CHECKPOINT_PHASE_TWO_PHASE                     11
+#define PROGRESS_CHECKPOINT_PHASE_OLD_XLOG_RECYCLE              12
+#define PROGRESS_CHECKPOINT_PHASE_FINALIZE                      13
+
 #endif
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index 9737e1eb67..fed52efa30 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -58,7 +58,7 @@ typedef struct FileTag
 extern void InitSync(void);
 extern void SyncPreCheckpoint(void);
 extern void SyncPostCheckpoint(void);
-extern void ProcessSyncRequests(void);
+extern void ProcessSyncRequests(int flags);
 extern void RememberSyncRequest(const FileTag *ftag, SyncRequestType type);
 extern bool RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
 								bool retryOnError);
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 47bf8029b0..02d51fb948 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -27,7 +27,8 @@ typedef enum ProgressCommandType
 	PROGRESS_COMMAND_CLUSTER,
 	PROGRESS_COMMAND_CREATE_INDEX,
 	PROGRESS_COMMAND_BASEBACKUP,
-	PROGRESS_COMMAND_COPY
+	PROGRESS_COMMAND_COPY,
+	PROGRESS_COMMAND_CHECKPOINT
 } ProgressCommandType;
 
 #define PGSTAT_NUM_PROGRESS_PARAM	20
-- 
2.25.1

#10

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#9)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Apart from above fields, I am planning to add few more fields to the
view in the next patch. That is, process ID of the backend process
which triggered a CHECKPOINT command, checkpoint start location, filed
to indicate whether it is a checkpoint or restartpoint and elapsed
time of the checkpoint operation. Please share your thoughts. I would
be happy to add any other information that contributes to showing the
progress of checkpoint.

The progress reporting mechanism of postgres uses the
'st_progress_param' array of 'PgBackendStatus' structure to hold the
information related to the progress. There is a function
'pgstat_progress_update_param()' which takes 'index' and 'val' as
arguments and updates the 'val' to corresponding 'index' in the
'st_progress_param' array. This mechanism works fine when all the
progress information is of type integer as the data type of
'st_progress_param' is of type integer. If the progress data is of
different type than integer, then there is no easy way to do so. In my
understanding, define a new structure with additional fields. Add this
as part of the 'PgBackendStatus' structure and support the necessary
function to update and fetch the data from this structure. This
becomes very ugly as it will not match the existing mechanism of
progress reporting. Kindly let me know if there is any better way to
handle this. If there are any changes to the existing mechanism to
make it generic to support basic data types, I would like to discuss
this in the new thread.

On Thu, Feb 10, 2022 at 12:22 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Show quoted text

We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of just emitting the stats at the end.

Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid, substep it is running, whether it is on target for completion, checkpoint_Reason
(manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some refactoring here.

I agree to provide above mentioned information as part of showing the
progress of current checkpoint operation. I am currently looking into
the code to know if any other information can be added.

Here is the initial patch to show the progress of checkpoint through
pg_stat_progress_checkpoint view. Please find the attachment.

The information added to this view are pid - process ID of a
CHECKPOINTER process, kind - kind of checkpoint indicates the reason
for checkpoint (values can be wal, time or force), phase - indicates
the current phase of checkpoint operation, total_buffer_writes - total
number of buffers to be written, buffers_processed - number of buffers
processed, buffers_written - number of buffers written,
total_file_syncs - total number of files to be synced, files_synced -
number of files synced.

There are many operations happen as part of checkpoint. For each of
the operation I am updating the phase field of
pg_stat_progress_checkpoint view. The values supported for this field
are initializing, checkpointing replication slots, checkpointing
snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
buffers, performing sync requests, performing two phase checkpoint,
recycling old XLOG files and Finalizing. In case of checkpointing
buffers phase, the fields total_buffer_writes, buffers_processed and
buffers_written shows the detailed progress of writing buffers. In
case of performing sync requests phase, the fields total_file_syncs
and files_synced shows the detailed progress of syncing files. In
other phases, only the phase field is getting updated and it is
difficult to show the progress because we do not get the total number
of files count without traversing the directory. It is not worth to
calculate that as it affects the performance of the checkpoint. I also
gave a thought to just mention the number of files processed, but this
wont give a meaningful progress information (It can be treated as
statistics). Hence just updating the phase field in those scenarios.

Apart from above fields, I am planning to add few more fields to the
view in the next patch. That is, process ID of the backend process
which triggered a CHECKPOINT command, checkpoint start location, filed
to indicate whether it is a checkpoint or restartpoint and elapsed
time of the checkpoint operation. Please share your thoughts. I would
be happy to add any other information that contributes to showing the
progress of checkpoint.

As per the discussion in this thread, there should be some mechanism
to show the progress of checkpoint during shutdown and end-of-recovery
cases as we cannot access pg_stat_progress_checkpoint in those cases.
I am working on this to use log_startup_progress_interval mechanism to
log the progress in the server logs.

Kindly review the patch and share your thoughts.

On Fri, Jan 28, 2022 at 12:24 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Jan 21, 2022 at 11:07 AM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

I think the right choice to solve the *general* problem is the
mentioned pg_stat_progress_checkpoints.

We may want to *additionally* have the ability to log the progress
specifically for the special cases when we're not able to use that
view. And in those case, we can perhaps just use the existing
log_startup_progress_interval parameter for this as well -- at least
for the startup checkpoint.

+1

We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of just emitting the stats at the end.

Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid, substep it is running, whether it is on target for completion, checkpoint_Reason
(manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some refactoring here.

I agree to provide above mentioned information as part of showing the
progress of current checkpoint operation. I am currently looking into
the code to know if any other information can be added.

As suggested in the other thread by Julien, I'm changing the subject
of this thread to reflect the discussion.

Regards,
Bharath Rupireddy.

#11

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#10)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Tue, 15 Feb 2022 at 13:16, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Apart from above fields, I am planning to add few more fields to the
view in the next patch. That is, process ID of the backend process
which triggered a CHECKPOINT command, checkpoint start location, filed
to indicate whether it is a checkpoint or restartpoint and elapsed
time of the checkpoint operation. Please share your thoughts. I would
be happy to add any other information that contributes to showing the
progress of checkpoint.

The progress reporting mechanism of postgres uses the
'st_progress_param' array of 'PgBackendStatus' structure to hold the
information related to the progress. There is a function
'pgstat_progress_update_param()' which takes 'index' and 'val' as
arguments and updates the 'val' to corresponding 'index' in the
'st_progress_param' array. This mechanism works fine when all the
progress information is of type integer as the data type of
'st_progress_param' is of type integer. If the progress data is of
different type than integer, then there is no easy way to do so.

Progress parameters are int64, so all of the new 'checkpoint start
location' (lsn = uint64), 'triggering backend PID' (int), 'elapsed
time' (store as start time in stat_progress, timestamp fits in 64
bits) and 'checkpoint or restartpoint?' (boolean) would each fit in a
current stat_progress parameter. Some processing would be required at
the view, but that's not impossible to overcome.

Kind regards,

Matthias van de Meent

#12

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#9)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Thu, 10 Feb 2022 at 07:53, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of just emitting the stats at the end.

Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid, substep it is running, whether it is on target for completion, checkpoint_Reason
(manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some refactoring here.

I agree to provide above mentioned information as part of showing the
progress of current checkpoint operation. I am currently looking into
the code to know if any other information can be added.

Here is the initial patch to show the progress of checkpoint through
pg_stat_progress_checkpoint view. Please find the attachment.

The information added to this view are pid - process ID of a
CHECKPOINTER process, kind - kind of checkpoint indicates the reason
for checkpoint (values can be wal, time or force), phase - indicates
the current phase of checkpoint operation, total_buffer_writes - total
number of buffers to be written, buffers_processed - number of buffers
processed, buffers_written - number of buffers written,
total_file_syncs - total number of files to be synced, files_synced -
number of files synced.

There are many operations happen as part of checkpoint. For each of
the operation I am updating the phase field of
pg_stat_progress_checkpoint view. The values supported for this field
are initializing, checkpointing replication slots, checkpointing
snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
buffers, performing sync requests, performing two phase checkpoint,
recycling old XLOG files and Finalizing. In case of checkpointing
buffers phase, the fields total_buffer_writes, buffers_processed and
buffers_written shows the detailed progress of writing buffers. In
case of performing sync requests phase, the fields total_file_syncs
and files_synced shows the detailed progress of syncing files. In
other phases, only the phase field is getting updated and it is
difficult to show the progress because we do not get the total number
of files count without traversing the directory. It is not worth to
calculate that as it affects the performance of the checkpoint. I also
gave a thought to just mention the number of files processed, but this
wont give a meaningful progress information (It can be treated as
statistics). Hence just updating the phase field in those scenarios.

Apart from above fields, I am planning to add few more fields to the
view in the next patch. That is, process ID of the backend process
which triggered a CHECKPOINT command, checkpoint start location, filed
to indicate whether it is a checkpoint or restartpoint and elapsed
time of the checkpoint operation. Please share your thoughts. I would
be happy to add any other information that contributes to showing the
progress of checkpoint.

As per the discussion in this thread, there should be some mechanism
to show the progress of checkpoint during shutdown and end-of-recovery
cases as we cannot access pg_stat_progress_checkpoint in those cases.
I am working on this to use log_startup_progress_interval mechanism to
log the progress in the server logs.

Kindly review the patch and share your thoughts.

Interesting idea, and overall a nice addition to the
pg_stat_progress_* reporting infrastructure.

Could you add your patch to the current commitfest at
https://commitfest.postgresql.org/37/?

See below for some comments on the patch:

xlog.c @ checkpoint_progress_start, checkpoint_progress_update_param, checkpoint_progress_end
+    /* In bootstrap mode, we don't actually record anything. */
+    if (IsBootstrapProcessingMode())
+        return;

Why do you check against the state of the system?
pgstat_progress_update_* already provides protections against updating
the progress tables if the progress infrastructure is not loaded; and
otherwise (in the happy path) the cost of updating the progress fields
will be quite a bit higher than normal. Updating stat_progress isn't
very expensive (quite cheap, really), so I don't quite get why you
guard against reporting stats when you expect no other client to be
listening.

I think you can simplify this a lot by directly using
pgstat_progress_update_param() instead.

xlog.c @ checkpoint_progress_start
+        pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
+        checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+                                         PROGRESS_CHECKPOINT_PHASE_INIT);
+        if (flags & CHECKPOINT_CAUSE_XLOG)
+            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
+                                             PROGRESS_CHECKPOINT_KIND_WAL);
+        else if (flags & CHECKPOINT_CAUSE_TIME)
+            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
+                                             PROGRESS_CHECKPOINT_KIND_TIME);
+ [...]

Could you assign the kind of checkpoint to a local variable, and then
update the "phase" and "kind" parameters at the same time through
pgstat_progress_update_multi_param(2, ...)? See
BuildRelationExtStatistics in extended_stats.c for an example usage.
Note that regardless of whether checkpoint_progress_update* will
remain, the checks done in that function already have been checked in
this function as well, so you can use the pgstat_* functions directly.

monitoring.sgml
+   <structname>pg_stat_progress_checkpoint</structname> view will contain a
+   single row indicating the progress of checkpoint operation.

... add "if a checkpoint is currently active".

+       <structfield>total_buffer_writes</structfield> <type>bigint</type>
+       <structfield>total_file_syncs</structfield> <type>bigint</type>

The other progress tables use [type]_total as column names for counter
targets (e.g. backup_total for backup_streamed, heap_blks_total for
heap_blks_scanned, etc.). I think that `buffers_total` and
`files_total` would be better column names.

+ The checkpoint operation is requested due to XLOG filling.

+ The checkpoint was started because >max_wal_size< of WAL was written.

+ The checkpoint operation is requested due to timeout.

+ The checkpoint was started due to the expiration of a

checkpoint_timeout< interval

+       The checkpoint operation is forced even if no XLOG activity has occurred
+       since the last one.

+ Some operation forced a checkpoint.

+ <entry><literal>checkpointing CommitTs pages</literal></entry>

CommitTs -> Commit time stamp

Thanks for working on this.

Kind regards,

Matthias van de Meent

#13

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Matthias van de Meent (#12)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

The progress reporting mechanism of postgres uses the
'st_progress_param' array of 'PgBackendStatus' structure to hold the
information related to the progress. There is a function
'pgstat_progress_update_param()' which takes 'index' and 'val' as
arguments and updates the 'val' to corresponding 'index' in the
'st_progress_param' array. This mechanism works fine when all the
progress information is of type integer as the data type of
'st_progress_param' is of type integer. If the progress data is of
different type than integer, then there is no easy way to do so.

Progress parameters are int64, so all of the new 'checkpoint start
location' (lsn = uint64), 'triggering backend PID' (int), 'elapsed
time' (store as start time in stat_progress, timestamp fits in 64
bits) and 'checkpoint or restartpoint?' (boolean) would each fit in a
current stat_progress parameter. Some processing would be required at
the view, but that's not impossible to overcome.

Thank you for sharing the information. 'triggering backend PID' (int)
- can be stored without any problem. 'checkpoint or restartpoint?'
(boolean) - can be stored as a integer value like
PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
start time in stat_progress, timestamp fits in 64 bits) - As
Timestamptz is of type int64 internally, so we can store the timestamp
value in the progres parameter and then expose a function like
'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
Timestamptz) as argument and then returns string representing the
elapsed time. This function can be called in the view. Is it
safe/advisable to use int64 type here rather than Timestamptz for this
purpose? 'checkpoint start location' (lsn = uint64) - I feel we
cannot use progress parameters for this case. As assigning uint64 to
int64 type would be an issue for larger values and can lead to hidden
bugs.

Thoughts?

Thanks & Regards,
Nitin Jadhav

On Thu, Feb 17, 2022 at 1:33 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Show quoted text

On Thu, 10 Feb 2022 at 07:53, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of just emitting the stats at the end.

Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid, substep it is running, whether it is on target for completion, checkpoint_Reason
(manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some refactoring here.

I agree to provide above mentioned information as part of showing the
progress of current checkpoint operation. I am currently looking into
the code to know if any other information can be added.

Here is the initial patch to show the progress of checkpoint through
pg_stat_progress_checkpoint view. Please find the attachment.

The information added to this view are pid - process ID of a
CHECKPOINTER process, kind - kind of checkpoint indicates the reason
for checkpoint (values can be wal, time or force), phase - indicates
the current phase of checkpoint operation, total_buffer_writes - total
number of buffers to be written, buffers_processed - number of buffers
processed, buffers_written - number of buffers written,
total_file_syncs - total number of files to be synced, files_synced -
number of files synced.

There are many operations happen as part of checkpoint. For each of
the operation I am updating the phase field of
pg_stat_progress_checkpoint view. The values supported for this field
are initializing, checkpointing replication slots, checkpointing
snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
buffers, performing sync requests, performing two phase checkpoint,
recycling old XLOG files and Finalizing. In case of checkpointing
buffers phase, the fields total_buffer_writes, buffers_processed and
buffers_written shows the detailed progress of writing buffers. In
case of performing sync requests phase, the fields total_file_syncs
and files_synced shows the detailed progress of syncing files. In
other phases, only the phase field is getting updated and it is
difficult to show the progress because we do not get the total number
of files count without traversing the directory. It is not worth to
calculate that as it affects the performance of the checkpoint. I also
gave a thought to just mention the number of files processed, but this
wont give a meaningful progress information (It can be treated as
statistics). Hence just updating the phase field in those scenarios.

Apart from above fields, I am planning to add few more fields to the
view in the next patch. That is, process ID of the backend process
which triggered a CHECKPOINT command, checkpoint start location, filed
to indicate whether it is a checkpoint or restartpoint and elapsed
time of the checkpoint operation. Please share your thoughts. I would
be happy to add any other information that contributes to showing the
progress of checkpoint.

As per the discussion in this thread, there should be some mechanism
to show the progress of checkpoint during shutdown and end-of-recovery
cases as we cannot access pg_stat_progress_checkpoint in those cases.
I am working on this to use log_startup_progress_interval mechanism to
log the progress in the server logs.

Kindly review the patch and share your thoughts.

Interesting idea, and overall a nice addition to the
pg_stat_progress_* reporting infrastructure.

Could you add your patch to the current commitfest at
https://commitfest.postgresql.org/37/?

See below for some comments on the patch:
xlog.c @ checkpoint_progress_start, checkpoint_progress_update_param, checkpoint_progress_end
+    /* In bootstrap mode, we don't actually record anything. */
+    if (IsBootstrapProcessingMode())
+        return;
Why do you check against the state of the system?
pgstat_progress_update_* already provides protections against updating
the progress tables if the progress infrastructure is not loaded; and
otherwise (in the happy path) the cost of updating the progress fields
will be quite a bit higher than normal. Updating stat_progress isn't
very expensive (quite cheap, really), so I don't quite get why you
guard against reporting stats when you expect no other client to be
listening.

I think you can simplify this a lot by directly using
pgstat_progress_update_param() instead.
xlog.c @ checkpoint_progress_start
+        pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
+        checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+                                         PROGRESS_CHECKPOINT_PHASE_INIT);
+        if (flags & CHECKPOINT_CAUSE_XLOG)
+            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
+                                             PROGRESS_CHECKPOINT_KIND_WAL);
+        else if (flags & CHECKPOINT_CAUSE_TIME)
+            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
+                                             PROGRESS_CHECKPOINT_KIND_TIME);
+ [...]
Could you assign the kind of checkpoint to a local variable, and then
update the "phase" and "kind" parameters at the same time through
pgstat_progress_update_multi_param(2, ...)? See
BuildRelationExtStatistics in extended_stats.c for an example usage.
Note that regardless of whether checkpoint_progress_update* will
remain, the checks done in that function already have been checked in
this function as well, so you can use the pgstat_* functions directly.
monitoring.sgml
+   <structname>pg_stat_progress_checkpoint</structname> view will contain a
+   single row indicating the progress of checkpoint operation.
... add "if a checkpoint is currently active".
+       <structfield>total_buffer_writes</structfield> <type>bigint</type>
+       <structfield>total_file_syncs</structfield> <type>bigint</type>
The other progress tables use [type]_total as column names for counter
targets (e.g. backup_total for backup_streamed, heap_blks_total for
heap_blks_scanned, etc.). I think that `buffers_total` and
`files_total` would be better column names.

+ The checkpoint operation is requested due to XLOG filling.

+ The checkpoint was started because >max_wal_size< of WAL was written.

+ The checkpoint operation is requested due to timeout.

+ The checkpoint was started due to the expiration of a

checkpoint_timeout< interval
+       The checkpoint operation is forced even if no XLOG activity has occurred
+       since the last one.
+ Some operation forced a checkpoint.

+ <entry><literal>checkpointing CommitTs pages</literal></entry>

CommitTs -> Commit time stamp

Thanks for working on this.

Kind regards,

Matthias van de Meent

#14

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#13)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

On Thu, Feb 17, 2022 at 12:26:07PM +0530, Nitin Jadhav wrote:

Thank you for sharing the information. 'triggering backend PID' (int)
- can be stored without any problem.

There can be multiple processes triggering a checkpoint, or at least wanting it
to happen or happen faster.

'checkpoint or restartpoint?'

Do you actually need to store that? Can't it be inferred from
pg_is_in_recovery()?

#15

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#13)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Thu, 17 Feb 2022 at 07:56, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Progress parameters are int64, so all of the new 'checkpoint start
location' (lsn = uint64), 'triggering backend PID' (int), 'elapsed
time' (store as start time in stat_progress, timestamp fits in 64
bits) and 'checkpoint or restartpoint?' (boolean) would each fit in a
current stat_progress parameter. Some processing would be required at
the view, but that's not impossible to overcome.

Thank you for sharing the information. 'triggering backend PID' (int)
- can be stored without any problem. 'checkpoint or restartpoint?'
(boolean) - can be stored as a integer value like
PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
start time in stat_progress, timestamp fits in 64 bits) - As
Timestamptz is of type int64 internally, so we can store the timestamp
value in the progres parameter and then expose a function like
'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
Timestamptz) as argument and then returns string representing the
elapsed time.

No need to use a string there; I think exposing the checkpoint start
time is good enough. The conversion of int64 to timestamp[tz] can be
done in SQL (although I'm not sure that exposing the internal bitwise
representation of Interval should be exposed to that extent) [0]Assuming we don't care about the years past 294246 CE (2942467 is when int64 overflows into negatives), the following works without any precision losses: SELECT to_timestamp((stat.my_int64::bigint/1000000)::float8) + make_interval(0, 0, 0, 0, 0, 0, MOD(stat.my_int64, 1000000)::float8 / 1000000::float8) FROM (SELECT 1::bigint) AS stat(my_int64);.
Users can then extract the duration interval using now() - start_time,
which also allows the user to use their own preferred formatting.

This function can be called in the view. Is it
safe/advisable to use int64 type here rather than Timestamptz for this
purpose?

Yes, this must be exposed through int64, as the sql-callable
pg_stat_get_progress_info only exposes bigint columns. Any
transformation function may return other types (see
pg_indexam_progress_phasename for an example of that).

'checkpoint start location' (lsn = uint64) - I feel we
cannot use progress parameters for this case. As assigning uint64 to
int64 type would be an issue for larger values and can lead to hidden
bugs.

Not necessarily - we can (without much trouble) do a bitwise cast from
uint64 to int64, and then (in SQL) cast it back to a pg_lsn [1]SELECT '0/0'::pg_lsn + ((CASE WHEN stat.my_int64 < 0 THEN pow(2::numeric, 64::numeric)::numeric ELSE 0::numeric END) + stat.my_int64::numeric) FROM (SELECT -2::bigint /* 0xFFFFFFFF/FFFFFFFE */ AS my_bigint_lsn) AS stat(my_int64);. Not
very elegant, but it works quite well.

Kind regards,

Matthias van de Meent

[0]: Assuming we don't care about the years past 294246 CE (2942467 is when int64 overflows into negatives), the following works without any precision losses: SELECT to_timestamp((stat.my_int64::bigint/1000000)::float8) + make_interval(0, 0, 0, 0, 0, 0, MOD(stat.my_int64, 1000000)::float8 / 1000000::float8) FROM (SELECT 1::bigint) AS stat(my_int64);
when int64 overflows into negatives), the following works without any
precision losses: SELECT
to_timestamp((stat.my_int64::bigint/1000000)::float8) +
make_interval(0, 0, 0, 0, 0, 0, MOD(stat.my_int64, 1000000)::float8 /
1000000::float8) FROM (SELECT 1::bigint) AS stat(my_int64);
[1]: SELECT '0/0'::pg_lsn + ((CASE WHEN stat.my_int64 < 0 THEN pow(2::numeric, 64::numeric)::numeric ELSE 0::numeric END) + stat.my_int64::numeric) FROM (SELECT -2::bigint /* 0xFFFFFFFF/FFFFFFFE */ AS my_bigint_lsn) AS stat(my_int64);
pow(2::numeric, 64::numeric)::numeric ELSE 0::numeric END) +
stat.my_int64::numeric) FROM (SELECT -2::bigint /* 0xFFFFFFFF/FFFFFFFE
*/ AS my_bigint_lsn) AS stat(my_int64);

#16

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#14)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Thank you for sharing the information. 'triggering backend PID' (int)
- can be stored without any problem.

There can be multiple processes triggering a checkpoint, or at least wanting it
to happen or happen faster.

Yes. There can be multiple processes but there will be one checkpoint
operation at a time. So the backend PID corresponds to the current
checkpoint operation. Let me know if I am missing something.

'checkpoint or restartpoint?'

Do you actually need to store that? Can't it be inferred from
pg_is_in_recovery()?

AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
checkpoint or restartpoint because if the system exits from recovery
mode during restartpoint then any query to pg_stat_progress_checkpoint
view will return it as a checkpoint which is ideally not correct. Please
correct me if I am wrong.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Thu, Feb 17, 2022 at 4:35 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

On Thu, Feb 17, 2022 at 12:26:07PM +0530, Nitin Jadhav wrote:

Thank you for sharing the information. 'triggering backend PID' (int)
- can be stored without any problem.

There can be multiple processes triggering a checkpoint, or at least wanting it
to happen or happen faster.

'checkpoint or restartpoint?'

Do you actually need to store that? Can't it be inferred from
pg_is_in_recovery()?

#17

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#16)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

On Thu, Feb 17, 2022 at 10:39:02PM +0530, Nitin Jadhav wrote:

Thank you for sharing the information. 'triggering backend PID' (int)
- can be stored without any problem.

There can be multiple processes triggering a checkpoint, or at least wanting it
to happen or happen faster.

Yes. There can be multiple processes but there will be one checkpoint
operation at a time. So the backend PID corresponds to the current
checkpoint operation. Let me know if I am missing something.

If there's a checkpoint timed triggered and then someone calls
pg_start_backup() which then wait for the end of the current checkpoint
(possibly after changing the flags), I think the view should reflect that in
some way. Maybe storing an array of (pid, flags) is too much, but at least a
counter with the number of processes actively waiting for the end of the
checkpoint.

'checkpoint or restartpoint?'

Do you actually need to store that? Can't it be inferred from
pg_is_in_recovery()?

AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
checkpoint or restartpoint because if the system exits from recovery
mode during restartpoint then any query to pg_stat_progress_checkpoint
view will return it as a checkpoint which is ideally not correct. Please
correct me if I am wrong.

Recovery ends with an end-of-recovery checkpoint that has to finish before the
promotion can happen, so I don't think that a restart can still be in progress
if pg_is_in_recovery() returns false.

#18

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Matthias van de Meent (#12)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Interesting idea, and overall a nice addition to the
pg_stat_progress_* reporting infrastructure.

Could you add your patch to the current commitfest at
https://commitfest.postgresql.org/37/?

See below for some comments on the patch:

Thanks you for reviewing.
I have added it to the commitfest - https://commitfest.postgresql.org/37/3545/

xlog.c @ checkpoint_progress_start, checkpoint_progress_update_param, checkpoint_progress_end
+    /* In bootstrap mode, we don't actually record anything. */
+    if (IsBootstrapProcessingMode())
+        return;
Why do you check against the state of the system?
pgstat_progress_update_* already provides protections against updating
the progress tables if the progress infrastructure is not loaded; and
otherwise (in the happy path) the cost of updating the progress fields
will be quite a bit higher than normal. Updating stat_progress isn't
very expensive (quite cheap, really), so I don't quite get why you
guard against reporting stats when you expect no other client to be
listening.

Nice point. I agree that the extra guards(IsBootstrapProcessingMode()
and (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) ==
0) are not needed as the progress reporting mechanism handles that
internally (It only updates when there is an access to the
pg_stat_progress_activity view). I am planning to add the progress of
checkpoint during shutdown and end-of-recovery cases in server logs as
we don't have access to the view. In this case these guards are
necessary. checkpoint_progress_update_param() is a generic function to
report progress to the view or server logs. Thoughts?

I think you can simplify this a lot by directly using
pgstat_progress_update_param() instead.
xlog.c @ checkpoint_progress_start
+        pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
+        checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+                                         PROGRESS_CHECKPOINT_PHASE_INIT);
+        if (flags & CHECKPOINT_CAUSE_XLOG)
+            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
+                                             PROGRESS_CHECKPOINT_KIND_WAL);
+        else if (flags & CHECKPOINT_CAUSE_TIME)
+            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
+                                             PROGRESS_CHECKPOINT_KIND_TIME);
+ [...]
Could you assign the kind of checkpoint to a local variable, and then
update the "phase" and "kind" parameters at the same time through
pgstat_progress_update_multi_param(2, ...)? See
BuildRelationExtStatistics in extended_stats.c for an example usage.

I will make use of pgstat_progress_update_multi_param() in the next
patch to replace multiple calls to checkpoint_progress_update_param().

Note that regardless of whether checkpoint_progress_update* will
remain, the checks done in that function already have been checked in
this function as well, so you can use the pgstat_* functions directly.

As I mentioned before I am planning to add progress reporting in the
server logs, checkpoint_progress_update_param() is required and it
makes the job easier.

monitoring.sgml
+   <structname>pg_stat_progress_checkpoint</structname> view will contain a
+   single row indicating the progress of checkpoint operation.

... add "if a checkpoint is currently active".

I feel adding extra words here to indicate "if a checkpoint is
currently active" is not necessary as the view description provides
that information and also it aligns with the documentation of existing
progress views.

+       <structfield>total_buffer_writes</structfield> <type>bigint</type>
+       <structfield>total_file_syncs</structfield> <type>bigint</type>
The other progress tables use [type]_total as column names for counter
targets (e.g. backup_total for backup_streamed, heap_blks_total for
heap_blks_scanned, etc.). I think that `buffers_total` and
`files_total` would be better column names.

I agree and I will update this in the next patch.

+ The checkpoint operation is requested due to XLOG filling.

+ The checkpoint was started because >max_wal_size< of WAL was written.

How about this "The checkpoint is started because max_wal_size is reached".

+ The checkpoint operation is requested due to timeout.

+ The checkpoint was started due to the expiration of a

checkpoint_timeout< interval

"The checkpoint is started because checkpoint_timeout expired".

+       The checkpoint operation is forced even if no XLOG activity has occurred
+       since the last one.

+ Some operation forced a checkpoint.

"The checkpoint is started because some operation forced a checkpoint".

+ <entry><literal>checkpointing CommitTs pages</literal></entry>

CommitTs -> Commit time stamp

I will handle this in the next patch.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Thu, 10 Feb 2022 at 07:53, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of just emitting the stats at the end.

Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid, substep it is running, whether it is on target for completion, checkpoint_Reason
(manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some refactoring here.

I agree to provide above mentioned information as part of showing the
progress of current checkpoint operation. I am currently looking into
the code to know if any other information can be added.

Here is the initial patch to show the progress of checkpoint through
pg_stat_progress_checkpoint view. Please find the attachment.

The information added to this view are pid - process ID of a
CHECKPOINTER process, kind - kind of checkpoint indicates the reason
for checkpoint (values can be wal, time or force), phase - indicates
the current phase of checkpoint operation, total_buffer_writes - total
number of buffers to be written, buffers_processed - number of buffers
processed, buffers_written - number of buffers written,
total_file_syncs - total number of files to be synced, files_synced -
number of files synced.

There are many operations happen as part of checkpoint. For each of
the operation I am updating the phase field of
pg_stat_progress_checkpoint view. The values supported for this field
are initializing, checkpointing replication slots, checkpointing
snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
buffers, performing sync requests, performing two phase checkpoint,
recycling old XLOG files and Finalizing. In case of checkpointing
buffers phase, the fields total_buffer_writes, buffers_processed and
buffers_written shows the detailed progress of writing buffers. In
case of performing sync requests phase, the fields total_file_syncs
and files_synced shows the detailed progress of syncing files. In
other phases, only the phase field is getting updated and it is
difficult to show the progress because we do not get the total number
of files count without traversing the directory. It is not worth to
calculate that as it affects the performance of the checkpoint. I also
gave a thought to just mention the number of files processed, but this
wont give a meaningful progress information (It can be treated as
statistics). Hence just updating the phase field in those scenarios.

Apart from above fields, I am planning to add few more fields to the
view in the next patch. That is, process ID of the backend process
which triggered a CHECKPOINT command, checkpoint start location, filed
to indicate whether it is a checkpoint or restartpoint and elapsed
time of the checkpoint operation. Please share your thoughts. I would
be happy to add any other information that contributes to showing the
progress of checkpoint.

As per the discussion in this thread, there should be some mechanism
to show the progress of checkpoint during shutdown and end-of-recovery
cases as we cannot access pg_stat_progress_checkpoint in those cases.
I am working on this to use log_startup_progress_interval mechanism to
log the progress in the server logs.

Kindly review the patch and share your thoughts.

Interesting idea, and overall a nice addition to the
pg_stat_progress_* reporting infrastructure.

Could you add your patch to the current commitfest at
https://commitfest.postgresql.org/37/?

See below for some comments on the patch:
xlog.c @ checkpoint_progress_start, checkpoint_progress_update_param, checkpoint_progress_end
+    /* In bootstrap mode, we don't actually record anything. */
+    if (IsBootstrapProcessingMode())
+        return;
Why do you check against the state of the system?
pgstat_progress_update_* already provides protections against updating
the progress tables if the progress infrastructure is not loaded; and
otherwise (in the happy path) the cost of updating the progress fields
will be quite a bit higher than normal. Updating stat_progress isn't
very expensive (quite cheap, really), so I don't quite get why you
guard against reporting stats when you expect no other client to be
listening.

I think you can simplify this a lot by directly using
pgstat_progress_update_param() instead.
xlog.c @ checkpoint_progress_start
+        pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
+        checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
+                                         PROGRESS_CHECKPOINT_PHASE_INIT);
+        if (flags & CHECKPOINT_CAUSE_XLOG)
+            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
+                                             PROGRESS_CHECKPOINT_KIND_WAL);
+        else if (flags & CHECKPOINT_CAUSE_TIME)
+            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
+                                             PROGRESS_CHECKPOINT_KIND_TIME);
+ [...]
Could you assign the kind of checkpoint to a local variable, and then
update the "phase" and "kind" parameters at the same time through
pgstat_progress_update_multi_param(2, ...)? See
BuildRelationExtStatistics in extended_stats.c for an example usage.
Note that regardless of whether checkpoint_progress_update* will
remain, the checks done in that function already have been checked in
this function as well, so you can use the pgstat_* functions directly.
monitoring.sgml
+   <structname>pg_stat_progress_checkpoint</structname> view will contain a
+   single row indicating the progress of checkpoint operation.
... add "if a checkpoint is currently active".
+       <structfield>total_buffer_writes</structfield> <type>bigint</type>
+       <structfield>total_file_syncs</structfield> <type>bigint</type>
The other progress tables use [type]_total as column names for counter
targets (e.g. backup_total for backup_streamed, heap_blks_total for
heap_blks_scanned, etc.). I think that `buffers_total` and
`files_total` would be better column names.

+ The checkpoint operation is requested due to XLOG filling.

+ The checkpoint was started because >max_wal_size< of WAL was written.

+ The checkpoint operation is requested due to timeout.

+ The checkpoint was started due to the expiration of a

checkpoint_timeout< interval
+       The checkpoint operation is forced even if no XLOG activity has occurred
+       since the last one.
+ Some operation forced a checkpoint.

+ <entry><literal>checkpointing CommitTs pages</literal></entry>

CommitTs -> Commit time stamp

Thanks for working on this.

Kind regards,

Matthias van de Meent

#19

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#17)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Thank you for sharing the information. 'triggering backend PID' (int)
- can be stored without any problem.

There can be multiple processes triggering a checkpoint, or at least wanting it
to happen or happen faster.

Yes. There can be multiple processes but there will be one checkpoint
operation at a time. So the backend PID corresponds to the current
checkpoint operation. Let me know if I am missing something.

If there's a checkpoint timed triggered and then someone calls
pg_start_backup() which then wait for the end of the current checkpoint
(possibly after changing the flags), I think the view should reflect that in
some way. Maybe storing an array of (pid, flags) is too much, but at least a
counter with the number of processes actively waiting for the end of the
checkpoint.

Okay. I feel this can be added as additional field but it will not
replace backend_pid field as this represents the pid of the backend
which triggered the current checkpoint. Probably a new field named
'processes_wiating' or 'events_waiting' can be added for this purpose.
Thoughts?

'checkpoint or restartpoint?'

Do you actually need to store that? Can't it be inferred from
pg_is_in_recovery()?

AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
checkpoint or restartpoint because if the system exits from recovery
mode during restartpoint then any query to pg_stat_progress_checkpoint
view will return it as a checkpoint which is ideally not correct. Please
correct me if I am wrong.

Recovery ends with an end-of-recovery checkpoint that has to finish before the
promotion can happen, so I don't think that a restart can still be in progress
if pg_is_in_recovery() returns false.

Probably writing of buffers or syncing files may complete before
pg_is_in_recovery() returns false. But there are some cleanup
operations happen as part of the checkpoint. During this scenario, we
may get false value for pg_is_in_recovery(). Please refer following
piece of code which is present in CreateRestartpoint().

if (!RecoveryInProgress())
replayTLI = XLogCtl->InsertTimeLineID;

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Thu, Feb 17, 2022 at 10:57 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

On Thu, Feb 17, 2022 at 10:39:02PM +0530, Nitin Jadhav wrote:

Thank you for sharing the information. 'triggering backend PID' (int)
- can be stored without any problem.

There can be multiple processes triggering a checkpoint, or at least wanting it
to happen or happen faster.

Yes. There can be multiple processes but there will be one checkpoint
operation at a time. So the backend PID corresponds to the current
checkpoint operation. Let me know if I am missing something.

If there's a checkpoint timed triggered and then someone calls
pg_start_backup() which then wait for the end of the current checkpoint
(possibly after changing the flags), I think the view should reflect that in
some way. Maybe storing an array of (pid, flags) is too much, but at least a
counter with the number of processes actively waiting for the end of the
checkpoint.

'checkpoint or restartpoint?'

Do you actually need to store that? Can't it be inferred from
pg_is_in_recovery()?

AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
checkpoint or restartpoint because if the system exits from recovery
mode during restartpoint then any query to pg_stat_progress_checkpoint
view will return it as a checkpoint which is ideally not correct. Please
correct me if I am wrong.

Recovery ends with an end-of-recovery checkpoint that has to finish before the
promotion can happen, so I don't think that a restart can still be in progress
if pg_is_in_recovery() returns false.

#20

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#19)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

On Fri, Feb 18, 2022 at 12:20:26PM +0530, Nitin Jadhav wrote:

If there's a checkpoint timed triggered and then someone calls
pg_start_backup() which then wait for the end of the current checkpoint
(possibly after changing the flags), I think the view should reflect that in
some way. Maybe storing an array of (pid, flags) is too much, but at least a
counter with the number of processes actively waiting for the end of the
checkpoint.

Okay. I feel this can be added as additional field but it will not
replace backend_pid field as this represents the pid of the backend
which triggered the current checkpoint.

I don't think that's true. Requesting a checkpoint means telling the
checkpointer that it should wake up and start a checkpoint (or restore point)
if it's not already doing so, so the pid will always be the checkpointer pid.
The only exception is a standalone backend, but in that case you won't be able
to query that view anyway.

And also while looking at the patch I see there's the same problem that I
mentioned in the previous thread, which is that the effective flags can be
updated once the checkpoint started, and as-is the view won't reflect that. It
also means that you can't simply display one of wal, time or force but a
possible combination of the flags (including the one not handled in v1).

Probably a new field named 'processes_wiating' or 'events_waiting' can be
added for this purpose.

Maybe num_process_waiting?

'checkpoint or restartpoint?'

Do you actually need to store that? Can't it be inferred from
pg_is_in_recovery()?

AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
checkpoint or restartpoint because if the system exits from recovery
mode during restartpoint then any query to pg_stat_progress_checkpoint
view will return it as a checkpoint which is ideally not correct. Please
correct me if I am wrong.

Recovery ends with an end-of-recovery checkpoint that has to finish before the
promotion can happen, so I don't think that a restart can still be in progress
if pg_is_in_recovery() returns false.

Probably writing of buffers or syncing files may complete before
pg_is_in_recovery() returns false. But there are some cleanup
operations happen as part of the checkpoint. During this scenario, we
may get false value for pg_is_in_recovery(). Please refer following
piece of code which is present in CreateRestartpoint().

if (!RecoveryInProgress())
replayTLI = XLogCtl->InsertTimeLineID;

Then maybe we could store the timeline rather then then kind of checkpoint?
You should still be able to compute the information while giving a bit more
information for the same memory usage.

#21

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#20)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Okay. I feel this can be added as additional field but it will not
replace backend_pid field as this represents the pid of the backend
which triggered the current checkpoint.

I don't think that's true. Requesting a checkpoint means telling the
checkpointer that it should wake up and start a checkpoint (or restore point)
if it's not already doing so, so the pid will always be the checkpointer pid.
The only exception is a standalone backend, but in that case you won't be able
to query that view anyway.

Yes. I agree that the checkpoint will always be performed by the
checkpointer process. So the pid in the pg_stat_progress_checkpoint
view will always correspond to the checkpointer pid only. Checkpoints
get triggered in many scenarios. One of the cases is the CHECKPOINT
command issued explicitly by the backend. In this scenario I would
like to know the backend pid which triggered the checkpoint. Hence I
would like to add a backend_pid field. So the
pg_stat_progress_checkpoint view contains pid fields as well as
backend_pid fields. The backend_pid contains a valid value only during
the CHECKPOINT command issued by the backend explicitly, otherwise the
value will be 0. We may have to add an additional field to
'CheckpointerShmemStruct' to hold the backend pid. The backend
requesting the checkpoint will update its pid to this structure.
Kindly let me know if you still feel the backend_pid field is not
necessary.

And also while looking at the patch I see there's the same problem that I
mentioned in the previous thread, which is that the effective flags can be
updated once the checkpoint started, and as-is the view won't reflect that. It
also means that you can't simply display one of wal, time or force but a
possible combination of the flags (including the one not handled in v1).

If I understand the above comment properly, it has 2 points. First is
to display the combination of flags rather than just displaying wal,
time or force - The idea behind this is to just let the user know the
reason for checkpointing. That is, the checkpoint is started because
max_wal_size is reached or checkpoint_timeout expired or explicitly
issued CHECKPOINT command. The other flags like CHECKPOINT_IMMEDIATE,
CHECKPOINT_WAIT or CHECKPOINT_FLUSH_ALL indicate how the checkpoint
has to be performed. Hence I have not included those in the view. If
it is really required, I would like to modify the code to include
other flags and display the combination. Second point is to reflect
the updated flags in the view. AFAIK, there is a possibility that the
flags get updated during the on-going checkpoint but the reason for
checkpoint (wal, time or force) will remain same for the current
checkpoint. There might be a change in how checkpoint has to be
performed if CHECKPOINT_IMMEDIATE flag is set. If we go with
displaying the combination of flags in the view, then probably we may
have to reflect this in the view.

Probably a new field named 'processes_wiating' or 'events_waiting' can be
added for this purpose.

Maybe num_process_waiting?

I feel 'processes_wiating' aligns more with the naming conventions of
the fields of the existing progres views.

Probably writing of buffers or syncing files may complete before
pg_is_in_recovery() returns false. But there are some cleanup
operations happen as part of the checkpoint. During this scenario, we
may get false value for pg_is_in_recovery(). Please refer following
piece of code which is present in CreateRestartpoint().

if (!RecoveryInProgress())
replayTLI = XLogCtl->InsertTimeLineID;

Then maybe we could store the timeline rather then then kind of checkpoint?
You should still be able to compute the information while giving a bit more
information for the same memory usage.

Can you please describe more about how checkpoint/restartpoint can be
confirmed using the timeline id.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Fri, Feb 18, 2022 at 1:13 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

On Fri, Feb 18, 2022 at 12:20:26PM +0530, Nitin Jadhav wrote:

If there's a checkpoint timed triggered and then someone calls
pg_start_backup() which then wait for the end of the current checkpoint
(possibly after changing the flags), I think the view should reflect that in
some way. Maybe storing an array of (pid, flags) is too much, but at least a
counter with the number of processes actively waiting for the end of the
checkpoint.

Okay. I feel this can be added as additional field but it will not
replace backend_pid field as this represents the pid of the backend
which triggered the current checkpoint.

I don't think that's true. Requesting a checkpoint means telling the
checkpointer that it should wake up and start a checkpoint (or restore point)
if it's not already doing so, so the pid will always be the checkpointer pid.
The only exception is a standalone backend, but in that case you won't be able
to query that view anyway.

And also while looking at the patch I see there's the same problem that I
mentioned in the previous thread, which is that the effective flags can be
updated once the checkpoint started, and as-is the view won't reflect that. It
also means that you can't simply display one of wal, time or force but a
possible combination of the flags (including the one not handled in v1).

Probably a new field named 'processes_wiating' or 'events_waiting' can be
added for this purpose.

Maybe num_process_waiting?

'checkpoint or restartpoint?'

Do you actually need to store that? Can't it be inferred from
pg_is_in_recovery()?

AFAIK we cannot use pg_is_in_recovery() to predict whether it is a
checkpoint or restartpoint because if the system exits from recovery
mode during restartpoint then any query to pg_stat_progress_checkpoint
view will return it as a checkpoint which is ideally not correct. Please
correct me if I am wrong.

Recovery ends with an end-of-recovery checkpoint that has to finish before the
promotion can happen, so I don't think that a restart can still be in progress
if pg_is_in_recovery() returns false.

Probably writing of buffers or syncing files may complete before
pg_is_in_recovery() returns false. But there are some cleanup
operations happen as part of the checkpoint. During this scenario, we
may get false value for pg_is_in_recovery(). Please refer following
piece of code which is present in CreateRestartpoint().

if (!RecoveryInProgress())
replayTLI = XLogCtl->InsertTimeLineID;

Then maybe we could store the timeline rather then then kind of checkpoint?
You should still be able to compute the information while giving a bit more
information for the same memory usage.

#22

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#21)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

On Fri, Feb 18, 2022 at 08:07:05PM +0530, Nitin Jadhav wrote:

The backend_pid contains a valid value only during
the CHECKPOINT command issued by the backend explicitly, otherwise the
value will be 0. We may have to add an additional field to
'CheckpointerShmemStruct' to hold the backend pid. The backend
requesting the checkpoint will update its pid to this structure.
Kindly let me know if you still feel the backend_pid field is not
necessary.

There are more scenarios where you can have a baackend requesting a checkpoint
and waiting for its completion, and there may be more than one backend
concerned, so I don't think that storing only one / the first backend pid is
ok.

And also while looking at the patch I see there's the same problem that I
mentioned in the previous thread, which is that the effective flags can be
updated once the checkpoint started, and as-is the view won't reflect that. It
also means that you can't simply display one of wal, time or force but a
possible combination of the flags (including the one not handled in v1).

If I understand the above comment properly, it has 2 points. First is
to display the combination of flags rather than just displaying wal,
time or force - The idea behind this is to just let the user know the
reason for checkpointing. That is, the checkpoint is started because
max_wal_size is reached or checkpoint_timeout expired or explicitly
issued CHECKPOINT command. The other flags like CHECKPOINT_IMMEDIATE,
CHECKPOINT_WAIT or CHECKPOINT_FLUSH_ALL indicate how the checkpoint
has to be performed. Hence I have not included those in the view. If
it is really required, I would like to modify the code to include
other flags and display the combination.

I think all the information should be exposed. Only knowing why the current
checkpoint has been triggered without any further information seems a bit
useless. Think for instance for cases like [1]/messages/by-id/1486805889.24568.96.camel@credativ.de.

Second point is to reflect
the updated flags in the view. AFAIK, there is a possibility that the
flags get updated during the on-going checkpoint but the reason for
checkpoint (wal, time or force) will remain same for the current
checkpoint. There might be a change in how checkpoint has to be
performed if CHECKPOINT_IMMEDIATE flag is set. If we go with
displaying the combination of flags in the view, then probably we may
have to reflect this in the view.

You can only "upgrade" a checkpoint, but not "downgrade" it. So if for
instance you find both CHECKPOINT_CAUSE_TIME and CHECKPOINT_FORCE (which is
possible) you can easily know which one was the one that triggered the
checkpoint and which one was added later.

Probably a new field named 'processes_wiating' or 'events_waiting' can be
added for this purpose.

Maybe num_process_waiting?

I feel 'processes_wiating' aligns more with the naming conventions of
the fields of the existing progres views.

There's at least pg_stat_progress_vacuum.num_dead_tuples. Anyway I don't have
a strong opinion on it, just make sure to correct the typo.

Probably writing of buffers or syncing files may complete before
pg_is_in_recovery() returns false. But there are some cleanup
operations happen as part of the checkpoint. During this scenario, we
may get false value for pg_is_in_recovery(). Please refer following
piece of code which is present in CreateRestartpoint().

if (!RecoveryInProgress())
replayTLI = XLogCtl->InsertTimeLineID;

Then maybe we could store the timeline rather then then kind of checkpoint?
You should still be able to compute the information while giving a bit more
information for the same memory usage.

Can you please describe more about how checkpoint/restartpoint can be
confirmed using the timeline id.

If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
restartpoint if the checkpoint's timeline is different from the current
timeline?

[1]: /messages/by-id/1486805889.24568.96.camel@credativ.de

#23

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#22)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Thank you for sharing the information. 'triggering backend PID' (int)
- can be stored without any problem. 'checkpoint or restartpoint?'
(boolean) - can be stored as a integer value like
PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
start time in stat_progress, timestamp fits in 64 bits) - As
Timestamptz is of type int64 internally, so we can store the timestamp
value in the progres parameter and then expose a function like
'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
Timestamptz) as argument and then returns string representing the
elapsed time.

No need to use a string there; I think exposing the checkpoint start
time is good enough. The conversion of int64 to timestamp[tz] can be
done in SQL (although I'm not sure that exposing the internal bitwise
representation of Interval should be exposed to that extent) [0].
Users can then extract the duration interval using now() - start_time,
which also allows the user to use their own preferred formatting.

The reason for showing the elapsed time rather than exposing the
timestamp directly is in case of checkpoint during shutdown and
end-of-recovery, I am planning to log a message in server logs using
'log_startup_progress_interval' infrastructure which displays elapsed
time. So just to match both of the behaviour I am displaying elapsed
time here. I feel that elapsed time gives a quicker feel of the
progress. Kindly let me know if you still feel just exposing the
timestamp is better than showing the elapsed time.

'checkpoint start location' (lsn = uint64) - I feel we
cannot use progress parameters for this case. As assigning uint64 to
int64 type would be an issue for larger values and can lead to hidden
bugs.

Not necessarily - we can (without much trouble) do a bitwise cast from
uint64 to int64, and then (in SQL) cast it back to a pg_lsn [1]. Not
very elegant, but it works quite well.

[1] SELECT '0/0'::pg_lsn + ((CASE WHEN stat.my_int64 < 0 THEN
pow(2::numeric, 64::numeric)::numeric ELSE 0::numeric END) +
stat.my_int64::numeric) FROM (SELECT -2::bigint /* 0xFFFFFFFF/FFFFFFFE
*/ AS my_bigint_lsn) AS stat(my_int64);

Thanks for sharing. It works. I will include this in the next patch.

Show quoted text

On Sat, Feb 19, 2022 at 11:02 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

On Fri, Feb 18, 2022 at 08:07:05PM +0530, Nitin Jadhav wrote:

The backend_pid contains a valid value only during
the CHECKPOINT command issued by the backend explicitly, otherwise the
value will be 0. We may have to add an additional field to
'CheckpointerShmemStruct' to hold the backend pid. The backend
requesting the checkpoint will update its pid to this structure.
Kindly let me know if you still feel the backend_pid field is not
necessary.

There are more scenarios where you can have a baackend requesting a checkpoint
and waiting for its completion, and there may be more than one backend
concerned, so I don't think that storing only one / the first backend pid is
ok.

And also while looking at the patch I see there's the same problem that I
mentioned in the previous thread, which is that the effective flags can be
updated once the checkpoint started, and as-is the view won't reflect that. It
also means that you can't simply display one of wal, time or force but a
possible combination of the flags (including the one not handled in v1).

If I understand the above comment properly, it has 2 points. First is
to display the combination of flags rather than just displaying wal,
time or force - The idea behind this is to just let the user know the
reason for checkpointing. That is, the checkpoint is started because
max_wal_size is reached or checkpoint_timeout expired or explicitly
issued CHECKPOINT command. The other flags like CHECKPOINT_IMMEDIATE,
CHECKPOINT_WAIT or CHECKPOINT_FLUSH_ALL indicate how the checkpoint
has to be performed. Hence I have not included those in the view. If
it is really required, I would like to modify the code to include
other flags and display the combination.

I think all the information should be exposed. Only knowing why the current
checkpoint has been triggered without any further information seems a bit
useless. Think for instance for cases like [1].

Second point is to reflect
the updated flags in the view. AFAIK, there is a possibility that the
flags get updated during the on-going checkpoint but the reason for
checkpoint (wal, time or force) will remain same for the current
checkpoint. There might be a change in how checkpoint has to be
performed if CHECKPOINT_IMMEDIATE flag is set. If we go with
displaying the combination of flags in the view, then probably we may
have to reflect this in the view.

You can only "upgrade" a checkpoint, but not "downgrade" it. So if for
instance you find both CHECKPOINT_CAUSE_TIME and CHECKPOINT_FORCE (which is
possible) you can easily know which one was the one that triggered the
checkpoint and which one was added later.

Probably a new field named 'processes_wiating' or 'events_waiting' can be
added for this purpose.

Maybe num_process_waiting?

I feel 'processes_wiating' aligns more with the naming conventions of
the fields of the existing progres views.

There's at least pg_stat_progress_vacuum.num_dead_tuples. Anyway I don't have
a strong opinion on it, just make sure to correct the typo.

Probably writing of buffers or syncing files may complete before
pg_is_in_recovery() returns false. But there are some cleanup
operations happen as part of the checkpoint. During this scenario, we
may get false value for pg_is_in_recovery(). Please refer following
piece of code which is present in CreateRestartpoint().

if (!RecoveryInProgress())
replayTLI = XLogCtl->InsertTimeLineID;

Then maybe we could store the timeline rather then then kind of checkpoint?
You should still be able to compute the information while giving a bit more
information for the same memory usage.

Can you please describe more about how checkpoint/restartpoint can be
confirmed using the timeline id.

If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
restartpoint if the checkpoint's timeline is different from the current
timeline?

[1] /messages/by-id/1486805889.24568.96.camel@credativ.de

#24

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#9)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

+/* Kinds of checkpoint (as advertised via PROGRESS_CHECKPOINT_KIND) */
+#define PROGRESS_CHECKPOINT_KIND_WAL                0
+#define PROGRESS_CHECKPOINT_KIND_TIME               1
+#define PROGRESS_CHECKPOINT_KIND_FORCE              2
+#define PROGRESS_CHECKPOINT_KIND_UNKNOWN            3

On what basis have you classified the above into the various types of
checkpoints? AFAIK, the first two types are based on what triggered
the checkpoint (whether it was the checkpoint_timeout or maz_wal_size
settings) while the third type indicates the force checkpoint that can
happen when the checkpoint is triggered for various reasons e.g. .
during createb or dropdb etc. This is quite possible that both the
PROGRESS_CHECKPOINT_KIND_TIME and PROGRESS_CHECKPOINT_KIND_FORCE flags
are set for the checkpoint because multiple checkpoint requests are
processed at one go, so what type of checkpoint would that be?

+        */
+       if ((flags & (CHECKPOINT_IS_SHUTDOWN |
CHECKPOINT_END_OF_RECOVERY)) == 0)
+       {
+
pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT,
InvalidOid);
+               checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_PHASE,
+
          PROGRESS_CHECKPOINT_PHASE_INIT);
+               if (flags & CHECKPOINT_CAUSE_XLOG)
+                       checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
                  PROGRESS_CHECKPOINT_KIND_WAL);
+               else if (flags & CHECKPOINT_CAUSE_TIME)
+                       checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
                  PROGRESS_CHECKPOINT_KIND_TIME);
+               else if (flags & CHECKPOINT_FORCE)
+                       checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
                  PROGRESS_CHECKPOINT_KIND_FORCE);
+               else
+                       checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
                  PROGRESS_CHECKPOINT_KIND_UNKNOWN);
+       }
+}

--
With Regards,
Ashutosh Sharma.

On Thu, Feb 10, 2022 at 12:23 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Show quoted text

We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of just emitting the stats at the end.

Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid, substep it is running, whether it is on target for completion, checkpoint_Reason
(manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some refactoring here.

I agree to provide above mentioned information as part of showing the
progress of current checkpoint operation. I am currently looking into
the code to know if any other information can be added.

Here is the initial patch to show the progress of checkpoint through
pg_stat_progress_checkpoint view. Please find the attachment.

The information added to this view are pid - process ID of a
CHECKPOINTER process, kind - kind of checkpoint indicates the reason
for checkpoint (values can be wal, time or force), phase - indicates
the current phase of checkpoint operation, total_buffer_writes - total
number of buffers to be written, buffers_processed - number of buffers
processed, buffers_written - number of buffers written,
total_file_syncs - total number of files to be synced, files_synced -
number of files synced.

There are many operations happen as part of checkpoint. For each of
the operation I am updating the phase field of
pg_stat_progress_checkpoint view. The values supported for this field
are initializing, checkpointing replication slots, checkpointing
snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
buffers, performing sync requests, performing two phase checkpoint,
recycling old XLOG files and Finalizing. In case of checkpointing
buffers phase, the fields total_buffer_writes, buffers_processed and
buffers_written shows the detailed progress of writing buffers. In
case of performing sync requests phase, the fields total_file_syncs
and files_synced shows the detailed progress of syncing files. In
other phases, only the phase field is getting updated and it is
difficult to show the progress because we do not get the total number
of files count without traversing the directory. It is not worth to
calculate that as it affects the performance of the checkpoint. I also
gave a thought to just mention the number of files processed, but this
wont give a meaningful progress information (It can be treated as
statistics). Hence just updating the phase field in those scenarios.

Apart from above fields, I am planning to add few more fields to the
view in the next patch. That is, process ID of the backend process
which triggered a CHECKPOINT command, checkpoint start location, filed
to indicate whether it is a checkpoint or restartpoint and elapsed
time of the checkpoint operation. Please share your thoughts. I would
be happy to add any other information that contributes to showing the
progress of checkpoint.

As per the discussion in this thread, there should be some mechanism
to show the progress of checkpoint during shutdown and end-of-recovery
cases as we cannot access pg_stat_progress_checkpoint in those cases.
I am working on this to use log_startup_progress_interval mechanism to
log the progress in the server logs.

Kindly review the patch and share your thoughts.

On Fri, Jan 28, 2022 at 12:24 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Jan 21, 2022 at 11:07 AM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

I think the right choice to solve the *general* problem is the
mentioned pg_stat_progress_checkpoints.

We may want to *additionally* have the ability to log the progress
specifically for the special cases when we're not able to use that
view. And in those case, we can perhaps just use the existing
log_startup_progress_interval parameter for this as well -- at least
for the startup checkpoint.

+1

We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of just emitting the stats at the end.

Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid, substep it is running, whether it is on target for completion, checkpoint_Reason
(manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some refactoring here.

I agree to provide above mentioned information as part of showing the
progress of current checkpoint operation. I am currently looking into
the code to know if any other information can be added.

As suggested in the other thread by Julien, I'm changing the subject
of this thread to reflect the discussion.

Regards,
Bharath Rupireddy.

#25

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#23)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Tue, 22 Feb 2022 at 07:39, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Thank you for sharing the information. 'triggering backend PID' (int)
- can be stored without any problem. 'checkpoint or restartpoint?'
(boolean) - can be stored as a integer value like
PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
start time in stat_progress, timestamp fits in 64 bits) - As
Timestamptz is of type int64 internally, so we can store the timestamp
value in the progres parameter and then expose a function like
'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
Timestamptz) as argument and then returns string representing the
elapsed time.

No need to use a string there; I think exposing the checkpoint start
time is good enough. The conversion of int64 to timestamp[tz] can be
done in SQL (although I'm not sure that exposing the internal bitwise
representation of Interval should be exposed to that extent) [0].
Users can then extract the duration interval using now() - start_time,
which also allows the user to use their own preferred formatting.

The reason for showing the elapsed time rather than exposing the
timestamp directly is in case of checkpoint during shutdown and
end-of-recovery, I am planning to log a message in server logs using
'log_startup_progress_interval' infrastructure which displays elapsed
time. So just to match both of the behaviour I am displaying elapsed
time here. I feel that elapsed time gives a quicker feel of the
progress. Kindly let me know if you still feel just exposing the
timestamp is better than showing the elapsed time.

At least for pg_stat_progress_checkpoint, storing only a timestamp in
the pg_stat storage (instead of repeatedly updating the field as a
duration) seems to provide much more precise measures of 'time
elapsed' for other sessions if one step of the checkpoint is taking a
long time.

I understand the want to integrate the log-based reporting in the same
API, but I don't think that is necessarily the right approach:
pg_stat_progress_* has low-overhead infrastructure specifically to
ensure that most tasks will not run much slower while reporting, never
waiting for locks. Logging, however, needs to take locks (if only to
prevent concurrent writes to the output file at a kernel level) and
thus has a not insignificant overhead and thus is not very useful for
precise and very frequent statistics updates.

So, although similar in nature, I don't think it is smart to use the
exact same infrastructure between pgstat_progress*-based reporting and
log-based progress reporting, especially if your logging-based
progress reporting is not intended to be a debugging-only
configuration option similar to log_min_messages=DEBUG[1..5].

- Matthias

#26

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#23)

1 attachment(s)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

I will make use of pgstat_progress_update_multi_param() in the next
patch to replace multiple calls to checkpoint_progress_update_param().

Fixed.
---

The other progress tables use [type]_total as column names for counter
targets (e.g. backup_total for backup_streamed, heap_blks_total for
heap_blks_scanned, etc.). I think that `buffers_total` and
`files_total` would be better column names.

I agree and I will update this in the next patch.

Fixed.
---

How about this "The checkpoint is started because max_wal_size is reached".

"The checkpoint is started because checkpoint_timeout expired".

"The checkpoint is started because some operation forced a checkpoint".

I have used the above description. Kindly let me know if any changes
are required.
---

+ <entry><literal>checkpointing CommitTs pages</literal></entry>

CommitTs -> Commit time stamp

I will handle this in the next patch.

Fixed.
---

There are more scenarios where you can have a baackend requesting a checkpoint
and waiting for its completion, and there may be more than one backend
concerned, so I don't think that storing only one / the first backend pid is
ok.

Thanks for this information. I am not considering backend_pid.
---

I think all the information should be exposed. Only knowing why the current
checkpoint has been triggered without any further information seems a bit
useless. Think for instance for cases like [1].

I have supported all possible checkpoint kinds. Added
pg_stat_get_progress_checkpoint_kind() to convert the flags (int) to a
string representing a combination of flags and also checking for the
flag update in ImmediateCheckpointRequested() which checks whether
CHECKPOINT_IMMEDIATE flag is set or not. I did not find any other
cases where the flags get changed (which changes the current
checkpoint behaviour) during the checkpoint. Kindly let me know if I
am missing something.
---

I feel 'processes_wiating' aligns more with the naming conventions of
the fields of the existing progres views.

There's at least pg_stat_progress_vacuum.num_dead_tuples. Anyway I don't have
a strong opinion on it, just make sure to correct the typo.

More analysis is required to support this. I am planning to take care
in the next patch.
---

If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
restartpoint if the checkpoint's timeline is different from the current
timeline?

Fixed.

Sharing the v2 patch. Kindly have a look and share your comments.

Thanks & Regards,
Nitin Jadhav

On Tue, Feb 22, 2022 at 12:08 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Show quoted text

Thank you for sharing the information. 'triggering backend PID' (int)
- can be stored without any problem. 'checkpoint or restartpoint?'
(boolean) - can be stored as a integer value like
PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
start time in stat_progress, timestamp fits in 64 bits) - As
Timestamptz is of type int64 internally, so we can store the timestamp
value in the progres parameter and then expose a function like
'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
Timestamptz) as argument and then returns string representing the
elapsed time.

No need to use a string there; I think exposing the checkpoint start
time is good enough. The conversion of int64 to timestamp[tz] can be
done in SQL (although I'm not sure that exposing the internal bitwise
representation of Interval should be exposed to that extent) [0].
Users can then extract the duration interval using now() - start_time,
which also allows the user to use their own preferred formatting.

The reason for showing the elapsed time rather than exposing the
timestamp directly is in case of checkpoint during shutdown and
end-of-recovery, I am planning to log a message in server logs using
'log_startup_progress_interval' infrastructure which displays elapsed
time. So just to match both of the behaviour I am displaying elapsed
time here. I feel that elapsed time gives a quicker feel of the
progress. Kindly let me know if you still feel just exposing the
timestamp is better than showing the elapsed time.

'checkpoint start location' (lsn = uint64) - I feel we
cannot use progress parameters for this case. As assigning uint64 to
int64 type would be an issue for larger values and can lead to hidden
bugs.

Not necessarily - we can (without much trouble) do a bitwise cast from
uint64 to int64, and then (in SQL) cast it back to a pg_lsn [1]. Not
very elegant, but it works quite well.

[1] SELECT '0/0'::pg_lsn + ((CASE WHEN stat.my_int64 < 0 THEN
pow(2::numeric, 64::numeric)::numeric ELSE 0::numeric END) +
stat.my_int64::numeric) FROM (SELECT -2::bigint /* 0xFFFFFFFF/FFFFFFFE
*/ AS my_bigint_lsn) AS stat(my_int64);

Thanks for sharing. It works. I will include this in the next patch.
On Sat, Feb 19, 2022 at 11:02 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

On Fri, Feb 18, 2022 at 08:07:05PM +0530, Nitin Jadhav wrote:

The backend_pid contains a valid value only during
the CHECKPOINT command issued by the backend explicitly, otherwise the
value will be 0. We may have to add an additional field to
'CheckpointerShmemStruct' to hold the backend pid. The backend
requesting the checkpoint will update its pid to this structure.
Kindly let me know if you still feel the backend_pid field is not
necessary.

There are more scenarios where you can have a baackend requesting a checkpoint
and waiting for its completion, and there may be more than one backend
concerned, so I don't think that storing only one / the first backend pid is
ok.

And also while looking at the patch I see there's the same problem that I
mentioned in the previous thread, which is that the effective flags can be
updated once the checkpoint started, and as-is the view won't reflect that. It
also means that you can't simply display one of wal, time or force but a
possible combination of the flags (including the one not handled in v1).

If I understand the above comment properly, it has 2 points. First is
to display the combination of flags rather than just displaying wal,
time or force - The idea behind this is to just let the user know the
reason for checkpointing. That is, the checkpoint is started because
max_wal_size is reached or checkpoint_timeout expired or explicitly
issued CHECKPOINT command. The other flags like CHECKPOINT_IMMEDIATE,
CHECKPOINT_WAIT or CHECKPOINT_FLUSH_ALL indicate how the checkpoint
has to be performed. Hence I have not included those in the view. If
it is really required, I would like to modify the code to include
other flags and display the combination.

I think all the information should be exposed. Only knowing why the current
checkpoint has been triggered without any further information seems a bit
useless. Think for instance for cases like [1].

Second point is to reflect
the updated flags in the view. AFAIK, there is a possibility that the
flags get updated during the on-going checkpoint but the reason for
checkpoint (wal, time or force) will remain same for the current
checkpoint. There might be a change in how checkpoint has to be
performed if CHECKPOINT_IMMEDIATE flag is set. If we go with
displaying the combination of flags in the view, then probably we may
have to reflect this in the view.

You can only "upgrade" a checkpoint, but not "downgrade" it. So if for
instance you find both CHECKPOINT_CAUSE_TIME and CHECKPOINT_FORCE (which is
possible) you can easily know which one was the one that triggered the
checkpoint and which one was added later.

Probably a new field named 'processes_wiating' or 'events_waiting' can be
added for this purpose.

Maybe num_process_waiting?

I feel 'processes_wiating' aligns more with the naming conventions of
the fields of the existing progres views.

There's at least pg_stat_progress_vacuum.num_dead_tuples. Anyway I don't have
a strong opinion on it, just make sure to correct the typo.

Probably writing of buffers or syncing files may complete before
pg_is_in_recovery() returns false. But there are some cleanup
operations happen as part of the checkpoint. During this scenario, we
may get false value for pg_is_in_recovery(). Please refer following
piece of code which is present in CreateRestartpoint().

if (!RecoveryInProgress())
replayTLI = XLogCtl->InsertTimeLineID;

Then maybe we could store the timeline rather then then kind of checkpoint?
You should still be able to compute the information while giving a bit more
information for the same memory usage.

Can you please describe more about how checkpoint/restartpoint can be
confirmed using the timeline id.

If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
restartpoint if the checkpoint's timeline is different from the current
timeline?

[1] /messages/by-id/1486805889.24568.96.camel@credativ.de

Attachments:

v2-0001-pg_stat_progress_checkpoint-view.patchapplication/octet-stream; name=v2-0001-pg_stat_progress_checkpoint-view.patchDownload

From 9e55ad5e1e1319cc32237e2ba79a097527d07fed Mon Sep 17 00:00:00 2001
From: Nitin Jadhav <nitinjadhav@microsoft.com>
Date: Wed, 23 Feb 2022 13:17:16 +0000
Subject: [PATCH] pg_stat_progress-checkpoint view

---
 doc/src/sgml/monitoring.sgml          | 357 ++++++++++++++++++++++++++
 src/backend/access/transam/xlog.c     | 132 ++++++++++
 src/backend/catalog/system_views.sql  |  36 +++
 src/backend/postmaster/checkpointer.c |  14 +-
 src/backend/storage/buffer/bufmgr.c   |   7 +
 src/backend/storage/sync/sync.c       |   6 +
 src/backend/utils/adt/pgstatfuncs.c   |  59 +++++
 src/include/access/xlog.h             |   4 +
 src/include/catalog/pg_proc.dat       |   9 +
 src/include/commands/progress.h       |  28 ++
 src/include/utils/backend_progress.h  |   3 +-
 src/test/regress/expected/rules.out   |  33 +++
 12 files changed, 684 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bf7625d988..1e7112eb78 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -401,6 +401,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
        See <xref linkend='copy-progress-reporting'/>.
       </entry>
      </row>
+
+     <row>
+      <entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
+      <entry>One row only, showing the progress of the <command>CHECKPOINT</command>.
+       See <xref linkend='checkpoint-progress-reporting'/>.
+      </entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
@@ -6895,6 +6902,356 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   </table>
  </sect2>
 
+ <sect2 id="checkpoint-progress-reporting">
+  <title>CHECKPOINT Progress Reporting</title>
+
+  <indexterm>
+   <primary>pg_stat_progress_checkpoint</primary>
+  </indexterm>
+
+  <para>
+   Whenever the checkpoint operation is running, the
+   <structname>pg_stat_progress_checkpoint</structname> view will contain a
+   single row indicating the progress of the checkpoint. The tables below
+   describe the information that will be reported and provide information about
+   how to interpret it.
+  </para>
+
+  <table id="pg-stat-progress-checkpoint-view" xreflabel="pg_stat_progress_checkpoint">
+   <title><structname>pg_stat_progress_checkpoint</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>pid</structfield> <type>integer</type>
+      </para>
+      <para>
+       Process ID of a CHECKPOINTER process.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of checkpoint. See <xref linkend="checkpoint-types"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>kind</structfield> <type>text</type>
+      </para>
+      <para>
+       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>start_lsn</structfield> <type>text</type>
+      </para>
+      <para>
+       The checkpoint start location.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>elapsed_time</structfield> <type>text</type>
+      </para>
+      <para>
+       Elapsed time of the checkpoint.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>phase</structfield> <type>text</type>
+      </para>
+      <para>
+       Current processing phase. See <xref linkend="checkpoint-phases"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_total</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of buffers to be written. This is estimated and reported
+       as of the beginning of buffer write operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_processed</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers processed. This counter increases when the targeted
+       buffer is processed. This number will eventually become equal to
+       <literal> total_buffer_writes </literal> when the checkpoint is
+       complete.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers written. This counter only advances when the targeted
+       buffers is written. Note that some of the buffers are processed but may
+       not required to be written. So this count will always be  less than or
+       equal to  <literal>total_buffer_writes</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_total</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of files to be synced. This is estimated and reported as of
+       the beginning of sync operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of files synced. This counter advances when the targeted file is
+       synced. This number will eventually become equal to
+       <literal>total_file_syncs</literal>  when the checkpoint is complete.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-types">
+   <title>CHECKPOINT types</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Types</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>checkpoint</literal></entry>
+      <entry>
+       The current operation is checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>restartpoint</literal></entry>
+      <entry>
+       The current operation is restartpoint.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-kinds">
+   <title>CHECKPOINT kinds</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Kinds</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>shutdown</literal></entry>
+      <entry>
+       The checkpoint is for shutdown.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>end-of-recovery</literal></entry>
+      <entry>
+       The checkpoint is for end-of-recovery.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>immediate</literal></entry>
+      <entry>
+       The checkpoint is happens without delays.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>force</literal></entry>
+      <entry>
+       The checkpoint is started because some operation forced a checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>flush all</literal></entry>
+      <entry>
+       The checkpoint flushes all pages, including those belonging to unlogged
+       tables.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>wait</literal></entry>
+      <entry>
+       Wait for completion before returning.
+      </entry>
+     </row>
+      <row>
+      <entry><literal>requested</literal></entry>
+      <entry>
+       The checkpoint request has been made.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>wal</literal></entry>
+      <entry>
+       The checkpoint is started because <literal>max_wal_size</literal> is
+       reached.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>time</literal></entry>
+      <entry>
+       The checkpoint is started because <literal>checkpoint_timeout</literal>
+       expired.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-phases">
+   <title>CHECKPOINT phases</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Phase</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>initializing</literal></entry>
+      <entry>
+       The CHECKPOINTER process is preparing to begin the checkpoint operation.
+       This phase is expected to be very brief.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing replication slots</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently flushing all the replication slots
+       to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing snapshots</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently removing all the serialized
+       snapshots that are not required anymore.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing logical rewrite mappings</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently removing/flushing the logical
+       rewrite mappings.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing CLOG pages</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently writing CLOG pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing Commit time stamp pages</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently writing Commit time stamp pages to
+       disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing SUBTRANS pages</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently writing SUBTRANS pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing MULTIXACT pages</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently writing MULTIXACT pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing SLRU pages</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently writing SLRU pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing buffers</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently writing buffers to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>performing sync requests</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently performing sync requests.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>performing two phase checkpoint</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently performing two phase checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>recycling old XLOG files</literal></entry>
+      <entry>
+       The CHECKPOINTER process is currently recycling old XLOG files.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>finalizing</literal></entry>
+      <entry>
+       The CHECKPOINTER process is finalizing the checkpoint operation. 
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  </sect1>
 
  <sect1 id="dynamic-trace">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2bd7a357..e5a19f3cc3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -65,6 +65,7 @@
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
+#include "commands/progress.h"
 #include "common/controldata_utils.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
@@ -653,6 +654,9 @@ static bool updateMinRecoveryPoint = true;
 static int	MyLockNo = 0;
 static bool holdingAllLocks = false;
 
+/* Copy of checkpoint flags. */
+static int ckpt_flags = 0;
+
 #ifdef WAL_DEBUG
 static MemoryContext walDebugCxt = NULL;
 #endif
@@ -6296,6 +6300,9 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	/* Prepare to report progress of the checkpoint. */
+	checkpoint_progress_start(flags);
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -6394,6 +6401,7 @@ CreateCheckPoint(int flags)
 			curInsert += SizeOfXLogShortPHD;
 	}
 	checkPoint.redo = curInsert;
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_LSN, checkPoint.redo);
 
 	/*
 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
@@ -6629,8 +6637,12 @@ CreateCheckPoint(int flags)
 		KeepLogSeg(recptr, &_logSegNo);
 	}
 	_logSegNo--;
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_OLD_XLOG_RECYCLE);
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr,
 					   checkPoint.ThisTimeLineID);
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_FINALIZE);
 
 	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
@@ -6652,6 +6664,9 @@ CreateCheckPoint(int flags)
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(false);
 
+	/* Stop reporting progress of the checkpoint. */
+	checkpoint_progress_end();
+
 	/* Reset the process title */
 	update_checkpoint_display(flags, false, true);
 
@@ -6808,29 +6823,60 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
+
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS);
 	CheckPointReplicationSlots();
+
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS);
 	CheckPointSnapBuild();
+
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS);
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
 
 	/* Write out all dirty data in SLRUs and the main buffer pool */
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
+
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES);
 	CheckPointCLOG();
+
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES);
 	CheckPointCommitTs();
+
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES);
 	CheckPointSUBTRANS();
+
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES);
 	CheckPointMultiXact();
+
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
 	CheckPointPredicate();
+
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_BUFFERS);
 	CheckPointBuffers(flags);
 
 	/* Perform all queued up fsyncs */
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_FILE_SYNC);
 	ProcessSyncRequests();
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 
 	/* We deliberately delay 2PC checkpointing as long as possible */
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_TWO_PHASE);
 	CheckPointTwoPhase(checkPointRedo);
 }
 
@@ -6977,6 +7023,9 @@ CreateRestartPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	/* Prepare to report progress of the restartpoint. */
+	checkpoint_progress_start(flags);
+
 	if (log_checkpoints)
 		LogCheckpointStart(flags, true);
 
@@ -7077,7 +7126,11 @@ CreateRestartPoint(int flags)
 	if (!RecoveryInProgress())
 		replayTLI = XLogCtl->InsertTimeLineID;
 
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_OLD_XLOG_RECYCLE);
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr, replayTLI);
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_FINALIZE);
 
 	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
@@ -7098,6 +7151,9 @@ CreateRestartPoint(int flags)
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(true);
 
+	/* Stop reporting progress of the restartpoint. */
+	checkpoint_progress_end();
+
 	/* Reset the process title */
 	update_checkpoint_display(flags, true, true);
 
@@ -9197,3 +9253,79 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl->WalWriterSleeping = sleeping;
 	SpinLockRelease(&XLogCtl->info_lck);
 }
+
+/*
+ * Start reporting progress of the checkpoint.
+ */
+void
+checkpoint_progress_start(int flags)
+{
+	/* In bootstrap mode, we don't actually record anything. */
+	if (IsBootstrapProcessingMode())
+		return;
+
+	ckpt_flags = flags;
+
+	/*
+	 * Cannot access pg_stat_progress_checkpoint view in case of checkpoint
+	 * during shutdown and end-of-recovery.
+	 */
+	if ((ckpt_flags &
+		 (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0)
+	{
+		const int	index[] = {
+			PROGRESS_CHECKPOINT_TIMELINE,
+			PROGRESS_CHECKPOINT_KIND,
+			PROGRESS_CHECKPOINT_PHASE,
+			PROGRESS_CHECKPOINT_START_TIMESTAMP
+			};
+		int64		val[4];
+
+		pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
+
+		val[0] = XLogCtl->InsertTimeLineID;
+		val[1] = flags;
+		val[2] = PROGRESS_CHECKPOINT_PHASE_INIT;
+		val[3] = CheckpointStats.ckpt_start_t;
+
+		pgstat_progress_update_multi_param(4, index, val);
+	}
+}
+
+/*
+ * Update index'th member in st_progress_param[] array with the latest value.
+ */
+void
+checkpoint_progress_update_param(int index, int64 val)
+{
+	/* In bootstrap mode, we don't actually record anything. */
+	if (IsBootstrapProcessingMode())
+		return;
+
+	/*
+	 * Cannot access pg_stat_progress_checkpoint view in case of checkpoint
+	 * during shutdown and end-of-recovery.
+	 */
+	if ((ckpt_flags &
+		 (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0)
+		pgstat_progress_update_param(index, val);
+}
+
+/*
+ * Stop reporting progress of the checkpoint.
+ */
+void
+checkpoint_progress_end(void)
+{
+	/* In bootstrap mode, we don't actually record anything. */
+	if (IsBootstrapProcessingMode())
+		return;
+
+	/*
+	 * Cannot access pg_stat_progress_checkpoint view in case of checkpoint
+	 * during shutdown and end-of-recovery.
+	 */
+	if ((ckpt_flags &
+		 (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0)
+		pgstat_progress_end_command();
+}
\ No newline at end of file
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3cb69b1f87..6dc5b0feb6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1286,3 +1286,39 @@ CREATE VIEW pg_stat_subscription_workers AS
           FROM pg_subscription_rel) sr,
           LATERAL pg_stat_get_subscription_worker(sr.subid, sr.relid) w
           JOIN pg_subscription s ON (w.subid = s.oid);
+
+CREATE VIEW pg_stat_progress_checkpoint AS
+    SELECT
+        S.pid AS pid,
+        pg_stat_get_progress_checkpoint_type(S.param1) AS type,
+        pg_stat_get_progress_checkpoint_kind(S.param2) AS kind,
+        ( SELECT '0/0'::pg_lsn +
+                 ((CASE
+                     WHEN stat.lsn_int64 < 0 THEN pow(2::numeric, 64::numeric)::numeric
+                     ELSE 0::numeric
+                  END) +
+                  stat.lsn_int64::numeric)
+          FROM (SELECT s.param3::bigint) AS stat(lsn_int64)
+        ) AS start_lsn,
+        pg_stat_get_progress_checkpoint_elapsed(S.param4) AS elapsed_time,
+        CASE S.param5 WHEN 0 THEN 'initializing'
+                      WHEN 1 THEN 'checkpointing replication slots'
+                      WHEN 2 THEN 'checkpointing snapshots'
+                      WHEN 3 THEN 'checkpointing logical rewrite mappings'
+                      WHEN 4 THEN 'checkpointing CLOG pages'
+                      WHEN 5 THEN 'checkpointing CommitTs pages'
+                      WHEN 6 THEN 'checkpointing SUBTRANS pages'
+                      WHEN 7 THEN 'checkpointing MULTIXACT pages'
+                      WHEN 8 THEN 'checkpointing SLRU pages'
+                      WHEN 9 THEN 'checkpointing buffers'
+                      WHEN 10 THEN 'performing sync requests'
+                      WHEN 11 THEN 'performing two phase checkpoint'
+                      WHEN 12 THEN 'recycling old XLOG files'
+                      WHEN 13 THEN 'Finalizing'
+                      END AS phase,
+        S.param6 AS total_buffer_writes,
+        S.param7 AS buffers_processed,
+        S.param8 AS buffers_written,
+        S.param9 AS total_file_syncs,
+        S.param10 AS files_synced
+    FROM pg_stat_get_progress_info('CHECKPOINT') AS S;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4488e3a443..2ff7e77b3e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogrecovery.h"
+#include "commands/progress.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -163,7 +164,7 @@ static pg_time_t last_xlog_switch_time;
 static void HandleCheckpointerInterrupts(void);
 static void CheckArchiveTimeout(void);
 static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
+static bool ImmediateCheckpointRequested(int flags);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
@@ -658,16 +659,23 @@ CheckArchiveTimeout(void)
  * there is one pending behind it.)
  */
 static bool
-ImmediateCheckpointRequested(void)
+ImmediateCheckpointRequested(int flags)
 {
 	volatile CheckpointerShmemStruct *cps = CheckpointerShmem;
+	int		 updated_flags = flags;
 
 	/*
 	 * We don't need to acquire the ckpt_lck in this case because we're only
 	 * looking at a single flag bit.
 	 */
 	if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
+	{
+		updated_flags |= CHECKPOINT_IMMEDIATE;
+		checkpoint_progress_update_param(PROGRESS_CHECKPOINT_KIND,
+										 updated_flags);
 		return true;
+	}
+
 	return false;
 }
 
@@ -699,7 +707,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
 		!ShutdownRequestPending &&
-		!ImmediateCheckpointRequested() &&
+		!ImmediateCheckpointRequested(flags) &&
 		IsCheckpointOnSchedule(progress))
 	{
 		if (ConfigReloadPending)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f8..156130ef43 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "commands/progress.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -2012,6 +2013,8 @@ BufferSync(int flags)
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_TOTAL,
+									 num_to_scan);
 
 	/*
 	 * Sort buffers that need to be written to reduce the likelihood of random
@@ -2129,6 +2132,8 @@ BufferSync(int flags)
 		bufHdr = GetBufferDescriptor(buf_id);
 
 		num_processed++;
+		checkpoint_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_PROCESSED,
+										 num_processed);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -2149,6 +2154,8 @@ BufferSync(int flags)
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
+				checkpoint_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_WRITTEN,
+												 num_written);
 			}
 		}
 
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e161d57761..f0441d4de8 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -23,6 +23,7 @@
 #include "access/multixact.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "commands/progress.h"
 #include "commands/tablespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -356,6 +357,9 @@ ProcessSyncRequests(void)
 	/* Now scan the hashtable for fsync requests to process */
 	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOps);
+	checkpoint_progress_update_param(PROGRESS_CHECKPOINT_FILES_TOTAL,
+									 hash_get_num_entries(pendingOps));
+
 	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
 	{
 		int			failures;
@@ -419,6 +423,8 @@ ProcessSyncRequests(void)
 						longest = elapsed;
 					total_elapsed += elapsed;
 					processed++;
+					checkpoint_progress_update_param(PROGRESS_CHECKPOINT_FILES_SYNCED,
+													 processed);
 
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 30e8dfa7c1..32c35f0bd0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -494,6 +494,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 		cmdtype = PROGRESS_COMMAND_BASEBACKUP;
 	else if (pg_strcasecmp(cmd, "COPY") == 0)
 		cmdtype = PROGRESS_COMMAND_COPY;
+	 else if (pg_strcasecmp(cmd, "CHECKPOINT") == 0)
+		cmdtype = PROGRESS_COMMAND_CHECKPOINT;
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -2495,3 +2497,60 @@ pg_stat_get_subscription_worker(PG_FUNCTION_ARGS)
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
+
+/*
+ * Return checkpoint type (either checkpoint or restartpoint).
+ */
+Datum
+pg_stat_get_progress_checkpoint_type(PG_FUNCTION_ARGS)
+{
+	TimeLineID	cur_timeline = (TimeLineID) PG_GETARG_INT64(0);
+
+	if (RecoveryInProgress() || (GetWALInsertionTimeLine() != cur_timeline))
+		PG_RETURN_TEXT_P(CStringGetTextDatum("restartpoint"));
+	else
+		PG_RETURN_TEXT_P(CStringGetTextDatum("checkpoint"));
+}
+
+/*
+ * Return checkpoint kind based on the flags set.
+ */
+Datum
+pg_stat_get_progress_checkpoint_kind(PG_FUNCTION_ARGS)
+{
+	int64	flags = PG_GETARG_INT64(0);
+	char	ckpt_kind[MAXPGPATH];
+
+	MemSet(ckpt_kind, 0, MAXPGPATH);
+	snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
+			 (flags == 0) ? "unknown" : "",
+			 (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
+			 (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
+			 (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
+			 (flags & CHECKPOINT_FORCE) ? "force " : "",
+			 (flags & CHECKPOINT_WAIT) ? "wait " : "",
+			 (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
+			 (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
+			 (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");
+
+	PG_RETURN_TEXT_P(CStringGetTextDatum(ckpt_kind));
+}
+
+/*
+ * Return elapsed time (in seconds) of the checkpoint.
+ */
+Datum
+pg_stat_get_progress_checkpoint_elapsed(PG_FUNCTION_ARGS)
+{
+	TimestampTz		start = PG_GETARG_INT64(0);
+	TimestampTz		now = GetCurrentTimestamp();
+	char			elapsed_time[NAMEDATALEN];
+	long    		secs;
+	int     		usecs;
+
+	TimestampDifference(start, now, &secs, &usecs);
+	snprintf(elapsed_time, sizeof(elapsed_time), "%ld.%02d s", secs,
+			 (usecs / 10000));
+
+	PG_RETURN_TEXT_P(CStringGetTextDatum(elapsed_time));
+}
\ No newline at end of file
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4b45ac64db..c0a3f57689 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -288,6 +288,10 @@ extern void do_pg_abort_backup(int code, Datum arg);
 extern void register_persistent_abort_backup_handler(void);
 extern SessionBackupState get_backup_status(void);
 
+extern void checkpoint_progress_start(int flags);
+extern void checkpoint_progress_update_param(int index, int64 val);
+extern void checkpoint_progress_end(void);
+
 /* File path names (all relative to $PGDATA) */
 #define RECOVERY_SIGNAL_FILE	"recovery.signal"
 #define STANDBY_SIGNAL_FILE		"standby.signal"
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7f1ee97f55..40f993c9b1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5353,6 +5353,15 @@
   proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
   proargnames => '{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10,param11,param12,param13,param14,param15,param16,param17,param18,param19,param20}',
   prosrc => 'pg_stat_get_progress_info' },
+{ oid => '560', descr => 'return checkpoint type',
+  proname => 'pg_stat_get_progress_checkpoint_type', prorettype => 'text',
+  proargtypes => 'int8', prosrc => 'pg_stat_get_progress_checkpoint_type' },
+{ oid => '561', descr => 'return checkpoint kind',
+  proname => 'pg_stat_get_progress_checkpoint_kind', prorettype => 'text',
+  proargtypes => 'int8', prosrc => 'pg_stat_get_progress_checkpoint_kind' },
+{ oid => '562', descr => 'return elapsed time of the checkpoint',
+  proname => 'pg_stat_get_progress_checkpoint_elapsed', prorettype => 'text',
+  proargtypes => 'int8', prosrc => 'pg_stat_get_progress_checkpoint_elapsed' },
 { oid => '3099',
   descr => 'statistics: information about currently active replication',
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index a28938caf4..19734405ed 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -151,4 +151,32 @@
 #define PROGRESS_COPY_TYPE_PIPE 3
 #define PROGRESS_COPY_TYPE_CALLBACK 4
 
+/* Progress parameters for checkpoint */
+#define PROGRESS_CHECKPOINT_TIMELINE                0
+#define PROGRESS_CHECKPOINT_KIND                    1
+#define PROGRESS_CHECKPOINT_LSN                     2
+#define PROGRESS_CHECKPOINT_START_TIMESTAMP         3
+#define PROGRESS_CHECKPOINT_PHASE                   4
+#define PROGRESS_CHECKPOINT_BUFFERS_TOTAL           5
+#define PROGRESS_CHECKPOINT_BUFFERS_PROCESSED       6
+#define PROGRESS_CHECKPOINT_BUFFERS_WRITTEN         7
+#define PROGRESS_CHECKPOINT_FILES_TOTAL             8
+#define PROGRESS_CHECKPOINT_FILES_SYNCED            9
+
+/* Phases of checkpoint (as advertised via PROGRESS_CHECKPOINT_PHASE) */
+#define PROGRESS_CHECKPOINT_PHASE_INIT                          0
+#define PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS                   1
+#define PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS                     2
+#define PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS      3
+#define PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES                    4
+#define PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES                5
+#define PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES                6
+#define PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES               7
+#define PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES                    8
+#define PROGRESS_CHECKPOINT_PHASE_BUFFERS                       9
+#define PROGRESS_CHECKPOINT_PHASE_FILE_SYNC                     10
+#define PROGRESS_CHECKPOINT_PHASE_TWO_PHASE                     11
+#define PROGRESS_CHECKPOINT_PHASE_OLD_XLOG_RECYCLE              12
+#define PROGRESS_CHECKPOINT_PHASE_FINALIZE                      13
+
 #endif
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 47bf8029b0..02d51fb948 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -27,7 +27,8 @@ typedef enum ProgressCommandType
 	PROGRESS_COMMAND_CLUSTER,
 	PROGRESS_COMMAND_CREATE_INDEX,
 	PROGRESS_COMMAND_BASEBACKUP,
-	PROGRESS_COMMAND_COPY
+	PROGRESS_COMMAND_COPY,
+	PROGRESS_COMMAND_CHECKPOINT
 } ProgressCommandType;
 
 #define PGSTAT_NUM_PROGRESS_PARAM	20
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 1420288d67..fe6e31ca27 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1897,6 +1897,39 @@ pg_stat_progress_basebackup| SELECT s.pid,
     s.param4 AS tablespaces_total,
     s.param5 AS tablespaces_streamed
    FROM pg_stat_get_progress_info('BASEBACKUP'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
+pg_stat_progress_checkpoint| SELECT s.pid,
+    pg_stat_get_progress_checkpoint_type(s.param1) AS type,
+    pg_stat_get_progress_checkpoint_kind(s.param2) AS kind,
+    ( SELECT ('0/0'::pg_lsn + (
+                CASE
+                    WHEN (stat.lsn_int64 < 0) THEN pow((2)::numeric, (64)::numeric)
+                    ELSE (0)::numeric
+                END + (stat.lsn_int64)::numeric))
+           FROM ( SELECT s.param3) stat(lsn_int64)) AS start_lsn,
+    pg_stat_get_progress_checkpoint_elapsed(s.param4) AS elapsed_time,
+        CASE s.param5
+            WHEN 0 THEN 'initializing'::text
+            WHEN 1 THEN 'checkpointing replication slots'::text
+            WHEN 2 THEN 'checkpointing snapshots'::text
+            WHEN 3 THEN 'checkpointing logical rewrite mappings'::text
+            WHEN 4 THEN 'checkpointing CLOG pages'::text
+            WHEN 5 THEN 'checkpointing CommitTs pages'::text
+            WHEN 6 THEN 'checkpointing SUBTRANS pages'::text
+            WHEN 7 THEN 'checkpointing MULTIXACT pages'::text
+            WHEN 8 THEN 'checkpointing SLRU pages'::text
+            WHEN 9 THEN 'checkpointing buffers'::text
+            WHEN 10 THEN 'performing sync requests'::text
+            WHEN 11 THEN 'performing two phase checkpoint'::text
+            WHEN 12 THEN 'recycling old XLOG files'::text
+            WHEN 13 THEN 'Finalizing'::text
+            ELSE NULL::text
+        END AS phase,
+    s.param6 AS total_buffer_writes,
+    s.param7 AS buffers_processed,
+    s.param8 AS buffers_written,
+    s.param9 AS total_file_syncs,
+    s.param10 AS files_synced
+   FROM pg_stat_get_progress_info('CHECKPOINT'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
 pg_stat_progress_cluster| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.25.1

#27

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#24)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On what basis have you classified the above into the various types of
checkpoints? AFAIK, the first two types are based on what triggered
the checkpoint (whether it was the checkpoint_timeout or maz_wal_size
settings) while the third type indicates the force checkpoint that can
happen when the checkpoint is triggered for various reasons e.g. .
during createb or dropdb etc. This is quite possible that both the
PROGRESS_CHECKPOINT_KIND_TIME and PROGRESS_CHECKPOINT_KIND_FORCE flags
are set for the checkpoint because multiple checkpoint requests are
processed at one go, so what type of checkpoint would that be?

My initial understanding was wrong. In the v2 patch I have supported
all values for checkpoint kinds and displaying a string in the
pg_stat_progress_checkpoint view which describes all the bits set in
the checkpoint flags.

Show quoted text

On Tue, Feb 22, 2022 at 8:10 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

+/* Kinds of checkpoint (as advertised via PROGRESS_CHECKPOINT_KIND) */
+#define PROGRESS_CHECKPOINT_KIND_WAL                0
+#define PROGRESS_CHECKPOINT_KIND_TIME               1
+#define PROGRESS_CHECKPOINT_KIND_FORCE              2
+#define PROGRESS_CHECKPOINT_KIND_UNKNOWN            3
On what basis have you classified the above into the various types of
checkpoints? AFAIK, the first two types are based on what triggered
the checkpoint (whether it was the checkpoint_timeout or maz_wal_size
settings) while the third type indicates the force checkpoint that can
happen when the checkpoint is triggered for various reasons e.g. .
during createb or dropdb etc. This is quite possible that both the
PROGRESS_CHECKPOINT_KIND_TIME and PROGRESS_CHECKPOINT_KIND_FORCE flags
are set for the checkpoint because multiple checkpoint requests are
processed at one go, so what type of checkpoint would that be?
+        */
+       if ((flags & (CHECKPOINT_IS_SHUTDOWN |
CHECKPOINT_END_OF_RECOVERY)) == 0)
+       {
+
pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT,
InvalidOid);
+               checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_PHASE,
+
PROGRESS_CHECKPOINT_PHASE_INIT);
+               if (flags & CHECKPOINT_CAUSE_XLOG)
+                       checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
PROGRESS_CHECKPOINT_KIND_WAL);
+               else if (flags & CHECKPOINT_CAUSE_TIME)
+                       checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
PROGRESS_CHECKPOINT_KIND_TIME);
+               else if (flags & CHECKPOINT_FORCE)
+                       checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
PROGRESS_CHECKPOINT_KIND_FORCE);
+               else
+                       checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
PROGRESS_CHECKPOINT_KIND_UNKNOWN);
+       }
+}
--
With Regards,
Ashutosh Sharma.

On Thu, Feb 10, 2022 at 12:23 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of just emitting the stats at the end.

Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid, substep it is running, whether it is on target for completion, checkpoint_Reason
(manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some refactoring here.

I agree to provide above mentioned information as part of showing the
progress of current checkpoint operation. I am currently looking into
the code to know if any other information can be added.

Here is the initial patch to show the progress of checkpoint through
pg_stat_progress_checkpoint view. Please find the attachment.

The information added to this view are pid - process ID of a
CHECKPOINTER process, kind - kind of checkpoint indicates the reason
for checkpoint (values can be wal, time or force), phase - indicates
the current phase of checkpoint operation, total_buffer_writes - total
number of buffers to be written, buffers_processed - number of buffers
processed, buffers_written - number of buffers written,
total_file_syncs - total number of files to be synced, files_synced -
number of files synced.

There are many operations happen as part of checkpoint. For each of
the operation I am updating the phase field of
pg_stat_progress_checkpoint view. The values supported for this field
are initializing, checkpointing replication slots, checkpointing
snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
buffers, performing sync requests, performing two phase checkpoint,
recycling old XLOG files and Finalizing. In case of checkpointing
buffers phase, the fields total_buffer_writes, buffers_processed and
buffers_written shows the detailed progress of writing buffers. In
case of performing sync requests phase, the fields total_file_syncs
and files_synced shows the detailed progress of syncing files. In
other phases, only the phase field is getting updated and it is
difficult to show the progress because we do not get the total number
of files count without traversing the directory. It is not worth to
calculate that as it affects the performance of the checkpoint. I also
gave a thought to just mention the number of files processed, but this
wont give a meaningful progress information (It can be treated as
statistics). Hence just updating the phase field in those scenarios.

Apart from above fields, I am planning to add few more fields to the
view in the next patch. That is, process ID of the backend process
which triggered a CHECKPOINT command, checkpoint start location, filed
to indicate whether it is a checkpoint or restartpoint and elapsed
time of the checkpoint operation. Please share your thoughts. I would
be happy to add any other information that contributes to showing the
progress of checkpoint.

As per the discussion in this thread, there should be some mechanism
to show the progress of checkpoint during shutdown and end-of-recovery
cases as we cannot access pg_stat_progress_checkpoint in those cases.
I am working on this to use log_startup_progress_interval mechanism to
log the progress in the server logs.

Kindly review the patch and share your thoughts.

On Fri, Jan 28, 2022 at 12:24 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Jan 21, 2022 at 11:07 AM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

I think the right choice to solve the *general* problem is the
mentioned pg_stat_progress_checkpoints.

We may want to *additionally* have the ability to log the progress
specifically for the special cases when we're not able to use that
view. And in those case, we can perhaps just use the existing
log_startup_progress_interval parameter for this as well -- at least
for the startup checkpoint.

+1

We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of just emitting the stats at the end.

Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid, substep it is running, whether it is on target for completion, checkpoint_Reason
(manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some refactoring here.

I agree to provide above mentioned information as part of showing the
progress of current checkpoint operation. I am currently looking into
the code to know if any other information can be added.

As suggested in the other thread by Julien, I'm changing the subject
of this thread to reflect the discussion.

Regards,
Bharath Rupireddy.

#28

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Matthias van de Meent (#25)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

At least for pg_stat_progress_checkpoint, storing only a timestamp in
the pg_stat storage (instead of repeatedly updating the field as a
duration) seems to provide much more precise measures of 'time
elapsed' for other sessions if one step of the checkpoint is taking a
long time.

I am storing the checkpoint start timestamp in the st_progress_param[]
and this gets set only once during the checkpoint (at the start of the
checkpoint). I have added function
pg_stat_get_progress_checkpoint_elapsed() which calculates the elapsed
time and returns a string. This function gets called whenever
pg_stat_progress_checkpoint view is queried. Kindly refer v2 patch and
share your thoughts.

I understand the want to integrate the log-based reporting in the same
API, but I don't think that is necessarily the right approach:
pg_stat_progress_* has low-overhead infrastructure specifically to
ensure that most tasks will not run much slower while reporting, never
waiting for locks. Logging, however, needs to take locks (if only to
prevent concurrent writes to the output file at a kernel level) and
thus has a not insignificant overhead and thus is not very useful for
precise and very frequent statistics updates.

I understand that the log based reporting is very costly and very
frequent updates are not advisable. I am planning to use the existing
infrastructure of 'log_startup_progress_interval' which provides an
option for the user to configure the interval between each progress
update. Hence it avoids frequent updates to server logs. This approach
is used only during shutdown and end-of-recovery cases because we
cannot access pg_stat_progress_checkpoint view during those scenarios.

So, although similar in nature, I don't think it is smart to use the
exact same infrastructure between pgstat_progress*-based reporting and
log-based progress reporting, especially if your logging-based
progress reporting is not intended to be a debugging-only
configuration option similar to log_min_messages=DEBUG[1..5].

Yes. I agree that we cannot use the same infrastructure for both.
Progress views and servers logs have different APIs to report the
progress information. But since both of this are required for the same
purpose, I am planning to use a common function which increases the
code readability than calling it separately in all the scenarios. I am
planning to include log based reporting in the next patch. Even after
that if using the same function is not recommended, I am happy to
change.

Thanks & Regards,
Nitin Jadhav

On Wed, Feb 23, 2022 at 12:13 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Show quoted text

On Tue, 22 Feb 2022 at 07:39, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Thank you for sharing the information. 'triggering backend PID' (int)
- can be stored without any problem. 'checkpoint or restartpoint?'
(boolean) - can be stored as a integer value like
PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
start time in stat_progress, timestamp fits in 64 bits) - As
Timestamptz is of type int64 internally, so we can store the timestamp
value in the progres parameter and then expose a function like
'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
Timestamptz) as argument and then returns string representing the
elapsed time.

No need to use a string there; I think exposing the checkpoint start
time is good enough. The conversion of int64 to timestamp[tz] can be
done in SQL (although I'm not sure that exposing the internal bitwise
representation of Interval should be exposed to that extent) [0].
Users can then extract the duration interval using now() - start_time,
which also allows the user to use their own preferred formatting.

The reason for showing the elapsed time rather than exposing the
timestamp directly is in case of checkpoint during shutdown and
end-of-recovery, I am planning to log a message in server logs using
'log_startup_progress_interval' infrastructure which displays elapsed
time. So just to match both of the behaviour I am displaying elapsed
time here. I feel that elapsed time gives a quicker feel of the
progress. Kindly let me know if you still feel just exposing the
timestamp is better than showing the elapsed time.

At least for pg_stat_progress_checkpoint, storing only a timestamp in
the pg_stat storage (instead of repeatedly updating the field as a
duration) seems to provide much more precise measures of 'time
elapsed' for other sessions if one step of the checkpoint is taking a
long time.

I understand the want to integrate the log-based reporting in the same
API, but I don't think that is necessarily the right approach:
pg_stat_progress_* has low-overhead infrastructure specifically to
ensure that most tasks will not run much slower while reporting, never
waiting for locks. Logging, however, needs to take locks (if only to
prevent concurrent writes to the output file at a kernel level) and
thus has a not insignificant overhead and thus is not very useful for
precise and very frequent statistics updates.

So, although similar in nature, I don't think it is smart to use the
exact same infrastructure between pgstat_progress*-based reporting and
log-based progress reporting, especially if your logging-based
progress reporting is not intended to be a debugging-only
configuration option similar to log_min_messages=DEBUG[1..5].

- Matthias

#29

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#26)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

+       if ((ckpt_flags &
+                (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0)
+       {

This code (present at multiple places) looks a little ugly to me, what
we can do instead is add a macro probably named IsShutdownCheckpoint()
which does the above check and use it in all the functions that have
this check. See below:

#define IsShutdownCheckpoint(flags) \
(flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY) != 0)

And then you may use this macro like:

if (IsBootstrapProcessingMode() || IsShutdownCheckpoint(flags))
return;

This change can be done in all these functions:

+void
+checkpoint_progress_start(int flags)

+ */
+void
+checkpoint_progress_update_param(int index, int64 val)

+ * Stop reporting progress of the checkpoint.
+ */
+void
+checkpoint_progress_end(void)

+
pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT,
InvalidOid);
+
+               val[0] = XLogCtl->InsertTimeLineID;
+               val[1] = flags;
+               val[2] = PROGRESS_CHECKPOINT_PHASE_INIT;
+               val[3] = CheckpointStats.ckpt_start_t;
+
+               pgstat_progress_update_multi_param(4, index, val);
+       }

Any specific reason for recording the timelineID in checkpoint stats
table? Will this ever change in our case?

--
With Regards,
Ashutosh Sharma.

On Wed, Feb 23, 2022 at 6:59 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Show quoted text

I will make use of pgstat_progress_update_multi_param() in the next
patch to replace multiple calls to checkpoint_progress_update_param().

Fixed.
---

The other progress tables use [type]_total as column names for counter
targets (e.g. backup_total for backup_streamed, heap_blks_total for
heap_blks_scanned, etc.). I think that `buffers_total` and
`files_total` would be better column names.

I agree and I will update this in the next patch.

Fixed.
---

How about this "The checkpoint is started because max_wal_size is reached".

"The checkpoint is started because checkpoint_timeout expired".

"The checkpoint is started because some operation forced a checkpoint".

I have used the above description. Kindly let me know if any changes
are required.
---

+ <entry><literal>checkpointing CommitTs pages</literal></entry>

CommitTs -> Commit time stamp

I will handle this in the next patch.

Fixed.
---

There are more scenarios where you can have a baackend requesting a checkpoint
and waiting for its completion, and there may be more than one backend
concerned, so I don't think that storing only one / the first backend pid is
ok.

Thanks for this information. I am not considering backend_pid.
---

I think all the information should be exposed. Only knowing why the current
checkpoint has been triggered without any further information seems a bit
useless. Think for instance for cases like [1].

I have supported all possible checkpoint kinds. Added
pg_stat_get_progress_checkpoint_kind() to convert the flags (int) to a
string representing a combination of flags and also checking for the
flag update in ImmediateCheckpointRequested() which checks whether
CHECKPOINT_IMMEDIATE flag is set or not. I did not find any other
cases where the flags get changed (which changes the current
checkpoint behaviour) during the checkpoint. Kindly let me know if I
am missing something.
---

I feel 'processes_wiating' aligns more with the naming conventions of
the fields of the existing progres views.

There's at least pg_stat_progress_vacuum.num_dead_tuples. Anyway I don't have
a strong opinion on it, just make sure to correct the typo.

More analysis is required to support this. I am planning to take care
in the next patch.
---

If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
restartpoint if the checkpoint's timeline is different from the current
timeline?

Fixed.

Sharing the v2 patch. Kindly have a look and share your comments.

Thanks & Regards,
Nitin Jadhav

On Tue, Feb 22, 2022 at 12:08 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Thank you for sharing the information. 'triggering backend PID' (int)
- can be stored without any problem. 'checkpoint or restartpoint?'
(boolean) - can be stored as a integer value like
PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
start time in stat_progress, timestamp fits in 64 bits) - As
Timestamptz is of type int64 internally, so we can store the timestamp
value in the progres parameter and then expose a function like
'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
Timestamptz) as argument and then returns string representing the
elapsed time.

No need to use a string there; I think exposing the checkpoint start
time is good enough. The conversion of int64 to timestamp[tz] can be
done in SQL (although I'm not sure that exposing the internal bitwise
representation of Interval should be exposed to that extent) [0].
Users can then extract the duration interval using now() - start_time,
which also allows the user to use their own preferred formatting.

The reason for showing the elapsed time rather than exposing the
timestamp directly is in case of checkpoint during shutdown and
end-of-recovery, I am planning to log a message in server logs using
'log_startup_progress_interval' infrastructure which displays elapsed
time. So just to match both of the behaviour I am displaying elapsed
time here. I feel that elapsed time gives a quicker feel of the
progress. Kindly let me know if you still feel just exposing the
timestamp is better than showing the elapsed time.

'checkpoint start location' (lsn = uint64) - I feel we
cannot use progress parameters for this case. As assigning uint64 to
int64 type would be an issue for larger values and can lead to hidden
bugs.

Not necessarily - we can (without much trouble) do a bitwise cast from
uint64 to int64, and then (in SQL) cast it back to a pg_lsn [1]. Not
very elegant, but it works quite well.

[1] SELECT '0/0'::pg_lsn + ((CASE WHEN stat.my_int64 < 0 THEN
pow(2::numeric, 64::numeric)::numeric ELSE 0::numeric END) +
stat.my_int64::numeric) FROM (SELECT -2::bigint /* 0xFFFFFFFF/FFFFFFFE
*/ AS my_bigint_lsn) AS stat(my_int64);

Thanks for sharing. It works. I will include this in the next patch.
On Sat, Feb 19, 2022 at 11:02 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

On Fri, Feb 18, 2022 at 08:07:05PM +0530, Nitin Jadhav wrote:

The backend_pid contains a valid value only during
the CHECKPOINT command issued by the backend explicitly, otherwise the
value will be 0. We may have to add an additional field to
'CheckpointerShmemStruct' to hold the backend pid. The backend
requesting the checkpoint will update its pid to this structure.
Kindly let me know if you still feel the backend_pid field is not
necessary.

There are more scenarios where you can have a baackend requesting a checkpoint
and waiting for its completion, and there may be more than one backend
concerned, so I don't think that storing only one / the first backend pid is
ok.

And also while looking at the patch I see there's the same problem that I
mentioned in the previous thread, which is that the effective flags can be
updated once the checkpoint started, and as-is the view won't reflect that. It
also means that you can't simply display one of wal, time or force but a
possible combination of the flags (including the one not handled in v1).

If I understand the above comment properly, it has 2 points. First is
to display the combination of flags rather than just displaying wal,
time or force - The idea behind this is to just let the user know the
reason for checkpointing. That is, the checkpoint is started because
max_wal_size is reached or checkpoint_timeout expired or explicitly
issued CHECKPOINT command. The other flags like CHECKPOINT_IMMEDIATE,
CHECKPOINT_WAIT or CHECKPOINT_FLUSH_ALL indicate how the checkpoint
has to be performed. Hence I have not included those in the view. If
it is really required, I would like to modify the code to include
other flags and display the combination.

I think all the information should be exposed. Only knowing why the current
checkpoint has been triggered without any further information seems a bit
useless. Think for instance for cases like [1].

Second point is to reflect
the updated flags in the view. AFAIK, there is a possibility that the
flags get updated during the on-going checkpoint but the reason for
checkpoint (wal, time or force) will remain same for the current
checkpoint. There might be a change in how checkpoint has to be
performed if CHECKPOINT_IMMEDIATE flag is set. If we go with
displaying the combination of flags in the view, then probably we may
have to reflect this in the view.

You can only "upgrade" a checkpoint, but not "downgrade" it. So if for
instance you find both CHECKPOINT_CAUSE_TIME and CHECKPOINT_FORCE (which is
possible) you can easily know which one was the one that triggered the
checkpoint and which one was added later.

Probably a new field named 'processes_wiating' or 'events_waiting' can be
added for this purpose.

Maybe num_process_waiting?

I feel 'processes_wiating' aligns more with the naming conventions of
the fields of the existing progres views.

There's at least pg_stat_progress_vacuum.num_dead_tuples. Anyway I don't have
a strong opinion on it, just make sure to correct the typo.

Probably writing of buffers or syncing files may complete before
pg_is_in_recovery() returns false. But there are some cleanup
operations happen as part of the checkpoint. During this scenario, we
may get false value for pg_is_in_recovery(). Please refer following
piece of code which is present in CreateRestartpoint().

if (!RecoveryInProgress())
replayTLI = XLogCtl->InsertTimeLineID;

Then maybe we could store the timeline rather then then kind of checkpoint?
You should still be able to compute the information while giving a bit more
information for the same memory usage.

Can you please describe more about how checkpoint/restartpoint can be
confirmed using the timeline id.

If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
restartpoint if the checkpoint's timeline is different from the current
timeline?

[1] /messages/by-id/1486805889.24568.96.camel@credativ.de

#30

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 4 years ago

In reply to: Nitin Jadhav (#26)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

I think the change to ImmediateCheckpointRequested() makes no sense.
Before this patch, that function merely inquires whether there's an
immediate checkpoint queued. After this patch, it ... changes a
progress-reporting flag? I think it would make more sense to make the
progress-report flag change in whatever is the place that *requests* an
immediate checkpoint rather than here.

I think the use of capitals in CHECKPOINT and CHECKPOINTER in the
documentation is excessive. (Same for terms such as MULTIXACT and
others in those docs; we typically use those in lowercase when
user-facing; and do we really use term CLOG anymore? Don't we call it
"commit log" nowadays?)

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
"Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)

#31

Justin Pryzby

pryzby@telsasoft.com

almost 4 years ago

In reply to: Nitin Jadhav (#26)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

+   Whenever the checkpoint operation is running, the
+   <structname>pg_stat_progress_checkpoint</structname> view will contain a
+   single row indicating the progress of the checkpoint. The tables below

Maybe it should show a single row , unless the checkpointer isn't running at
all (like in single user mode).

+ Process ID of a CHECKPOINTER process.

It's *the* checkpointer process.

pgstatfuncs.c has a whitespace issue (tab-space).

I suppose the functions should set provolatile.

--
Justin

#32

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#28)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Wed, 23 Feb 2022 at 15:24, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

At least for pg_stat_progress_checkpoint, storing only a timestamp in
the pg_stat storage (instead of repeatedly updating the field as a
duration) seems to provide much more precise measures of 'time
elapsed' for other sessions if one step of the checkpoint is taking a
long time.

I am storing the checkpoint start timestamp in the st_progress_param[]
and this gets set only once during the checkpoint (at the start of the
checkpoint). I have added function
pg_stat_get_progress_checkpoint_elapsed() which calculates the elapsed
time and returns a string. This function gets called whenever
pg_stat_progress_checkpoint view is queried. Kindly refer v2 patch and
share your thoughts.

I dislike the lack of access to the actual value of the checkpoint
start / checkpoint elapsed field.

As a user, if I query the pg_stat_progress_* views, my terminal or
application can easily interpret an `interval` value and cast it to
string, but the opposite is not true: the current implementation for
pg_stat_get_progress_checkpoint_elapsed loses precision. This is why
we use typed numeric fields in effectively all other places instead of
stringified versions of the values: oid fields, counters, etc are all
rendered as bigint in the view, so that no information is lost and
interpretation is trivial.

I understand the want to integrate the log-based reporting in the same
API, but I don't think that is necessarily the right approach:
pg_stat_progress_* has low-overhead infrastructure specifically to
ensure that most tasks will not run much slower while reporting, never
waiting for locks. Logging, however, needs to take locks (if only to
prevent concurrent writes to the output file at a kernel level) and
thus has a not insignificant overhead and thus is not very useful for
precise and very frequent statistics updates.

I understand that the log based reporting is very costly and very
frequent updates are not advisable. I am planning to use the existing
infrastructure of 'log_startup_progress_interval' which provides an
option for the user to configure the interval between each progress
update. Hence it avoids frequent updates to server logs. This approach
is used only during shutdown and end-of-recovery cases because we
cannot access pg_stat_progress_checkpoint view during those scenarios.

I see; but log_startup_progress_interval seems to be exclusively
consumed through the ereport_startup_progress macro. Why put
startup/shutdown logging on the same path as the happy flow of normal
checkpoints?

So, although similar in nature, I don't think it is smart to use the
exact same infrastructure between pgstat_progress*-based reporting and
log-based progress reporting, especially if your logging-based
progress reporting is not intended to be a debugging-only
configuration option similar to log_min_messages=DEBUG[1..5].

Yes. I agree that we cannot use the same infrastructure for both.
Progress views and servers logs have different APIs to report the
progress information. But since both of this are required for the same
purpose, I am planning to use a common function which increases the
code readability than calling it separately in all the scenarios. I am
planning to include log based reporting in the next patch. Even after
that if using the same function is not recommended, I am happy to
change.

I don't think that checkpoint_progress_update_param(int, uint64) fits
well with the construction of progress log messages, requiring
special-casing / matching the offset numbers to actual fields inside
that single function, which adds unnecessary overhead when compared
against normal and direct calls to the related infrastructure.

I think that, instead of looking to what might at some point be added,
it is better to use the currently available functions instead, and
move to new functions if and when the log-based reporting requires it.

- Matthias

#33

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#26)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Wed, 23 Feb 2022 at 14:28, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Sharing the v2 patch. Kindly have a look and share your comments.

Thanks for updating.

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml

With the new pg_stat_progress_checkpoint, you should also add a
backreference to this progress reporting in the CHECKPOINT sql command
documentation located in checkpoint.sgml, and maybe in wal.sgml and/or
backup.sgml too. See e.g. cluster.sgml around line 195 for an example.

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
+ImmediateCheckpointRequested(int flags)
if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
+    {
+        updated_flags |= CHECKPOINT_IMMEDIATE;

I don't think that these changes are expected behaviour. Under in this
condition; the currently running checkpoint is still not 'immediate',
but it is going to hurry up for a new, actually immediate checkpoint.
Those are different kinds of checkpoint handling; and I don't think
you should modify the reported flags to show that we're going to do
stuff faster than usual. Maybe maintiain a seperate 'upcoming
checkpoint flags' field instead?

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
+        ( SELECT '0/0'::pg_lsn +
+                 ((CASE
+                     WHEN stat.lsn_int64 < 0 THEN pow(2::numeric, 64::numeric)::numeric
+                     ELSE 0::numeric
+                  END) +
+                  stat.lsn_int64::numeric)
+          FROM (SELECT s.param3::bigint) AS stat(lsn_int64)
+        ) AS start_lsn,

My LSN select statement was an example that could be run directly in
psql; the so you didn't have to embed the SELECT into the view query.
The following should be sufficient (and save the planner a few cycles
otherwise spent in inlining):

+        ('0/0'::pg_lsn +
+                 ((CASE
+                     WHEN s.param3 < 0 THEN pow(2::numeric,
64::numeric)::numeric
+                     ELSE 0::numeric
+                  END) +
+                  s.param3::numeric)
+        ) AS start_lsn,

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
+checkpoint_progress_start(int flags)
[...]
+checkpoint_progress_update_param(int index, int64 val)
[...]
+checkpoint_progress_end(void)
+{
+    /* In bootstrap mode, we don't actually record anything. */
+    if (IsBootstrapProcessingMode())
+        return;

Disabling pgstat progress reporting when in bootstrap processing mode
/ startup/end-of-recovery makes very little sense (see upthread) and
should be removed, regardless of whether seperate functions stay.

diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
+#define PROGRESS_CHECKPOINT_PHASE_INIT                          0

Generally, enum-like values in a stat_progress field are 1-indexed, to
differentiate between empty/uninitialized (0) and states that have
been set by the progress reporting infrastructure.

Kind regards,

Matthias van de Meent

#34

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Matthias van de Meent (#33)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

I think the change to ImmediateCheckpointRequested() makes no sense.
Before this patch, that function merely inquires whether there's an
immediate checkpoint queued. After this patch, it ... changes a
progress-reporting flag? I think it would make more sense to make the
progress-report flag change in whatever is the place that *requests* an
immediate checkpoint rather than here.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
+ImmediateCheckpointRequested(int flags)
if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
+    {
+        updated_flags |= CHECKPOINT_IMMEDIATE;
I don't think that these changes are expected behaviour. Under in this
condition; the currently running checkpoint is still not 'immediate',
but it is going to hurry up for a new, actually immediate checkpoint.
Those are different kinds of checkpoint handling; and I don't think
you should modify the reported flags to show that we're going to do
stuff faster than usual. Maybe maintiain a seperate 'upcoming
checkpoint flags' field instead?

Thank you Alvaro and Matthias for your views. I understand your point
of not updating the progress-report flag here as it just checks
whether the CHECKPOINT_IMMEDIATE is set or not and takes an action
based on that but it doesn't change the checkpoint flags. I will
modify the code but I am a bit confused here. As per Alvaro, we need
to make the progress-report flag change in whatever is the place that
*requests* an immediate checkpoint. I feel this gives information
about the upcoming checkpoint not the current one. So updating here
provides wrong details in the view. The flags available during
CreateCheckPoint() will remain same for the entire checkpoint
operation and we should show the same information in the view till it
completes. So just removing the above piece of code (modified in
ImmediateCheckpointRequested()) in the patch will make it correct. My
opinion about maintaining a separate field to show upcoming checkpoint
flags is it makes the view complex. Please share your thoughts.

Thanks & Regards,

On Thu, Feb 24, 2022 at 10:45 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Show quoted text

On Wed, 23 Feb 2022 at 14:28, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Sharing the v2 patch. Kindly have a look and share your comments.

Thanks for updating.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
With the new pg_stat_progress_checkpoint, you should also add a
backreference to this progress reporting in the CHECKPOINT sql command
documentation located in checkpoint.sgml, and maybe in wal.sgml and/or
backup.sgml too. See e.g. cluster.sgml around line 195 for an example.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
+ImmediateCheckpointRequested(int flags)
if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
+    {
+        updated_flags |= CHECKPOINT_IMMEDIATE;
I don't think that these changes are expected behaviour. Under in this
condition; the currently running checkpoint is still not 'immediate',
but it is going to hurry up for a new, actually immediate checkpoint.
Those are different kinds of checkpoint handling; and I don't think
you should modify the reported flags to show that we're going to do
stuff faster than usual. Maybe maintiain a seperate 'upcoming
checkpoint flags' field instead?
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
+        ( SELECT '0/0'::pg_lsn +
+                 ((CASE
+                     WHEN stat.lsn_int64 < 0 THEN pow(2::numeric, 64::numeric)::numeric
+                     ELSE 0::numeric
+                  END) +
+                  stat.lsn_int64::numeric)
+          FROM (SELECT s.param3::bigint) AS stat(lsn_int64)
+        ) AS start_lsn,
My LSN select statement was an example that could be run directly in
psql; the so you didn't have to embed the SELECT into the view query.
The following should be sufficient (and save the planner a few cycles
otherwise spent in inlining):
+        ('0/0'::pg_lsn +
+                 ((CASE
+                     WHEN s.param3 < 0 THEN pow(2::numeric,
64::numeric)::numeric
+                     ELSE 0::numeric
+                  END) +
+                  s.param3::numeric)
+        ) AS start_lsn,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
+checkpoint_progress_start(int flags)
[...]
+checkpoint_progress_update_param(int index, int64 val)
[...]
+checkpoint_progress_end(void)
+{
+    /* In bootstrap mode, we don't actually record anything. */
+    if (IsBootstrapProcessingMode())
+        return;
Disabling pgstat progress reporting when in bootstrap processing mode
/ startup/end-of-recovery makes very little sense (see upthread) and
should be removed, regardless of whether seperate functions stay.
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
+#define PROGRESS_CHECKPOINT_PHASE_INIT                          0
Generally, enum-like values in a stat_progress field are 1-indexed, to
differentiate between empty/uninitialized (0) and states that have
been set by the progress reporting infrastructure.

Kind regards,

Matthias van de Meent

#35

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#34)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

On Fri, Feb 25, 2022 at 12:23:27AM +0530, Nitin Jadhav wrote:

I think the change to ImmediateCheckpointRequested() makes no sense.
Before this patch, that function merely inquires whether there's an
immediate checkpoint queued. After this patch, it ... changes a
progress-reporting flag? I think it would make more sense to make the
progress-report flag change in whatever is the place that *requests* an
immediate checkpoint rather than here.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
+ImmediateCheckpointRequested(int flags)
if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
+    {
+        updated_flags |= CHECKPOINT_IMMEDIATE;
I don't think that these changes are expected behaviour. Under in this
condition; the currently running checkpoint is still not 'immediate',
but it is going to hurry up for a new, actually immediate checkpoint.
Those are different kinds of checkpoint handling; and I don't think
you should modify the reported flags to show that we're going to do
stuff faster than usual. Maybe maintiain a seperate 'upcoming
checkpoint flags' field instead?
Thank you Alvaro and Matthias for your views. I understand your point
of not updating the progress-report flag here as it just checks
whether the CHECKPOINT_IMMEDIATE is set or not and takes an action
based on that but it doesn't change the checkpoint flags. I will
modify the code but I am a bit confused here. As per Alvaro, we need
to make the progress-report flag change in whatever is the place that
*requests* an immediate checkpoint. I feel this gives information
about the upcoming checkpoint not the current one. So updating here
provides wrong details in the view. The flags available during
CreateCheckPoint() will remain same for the entire checkpoint
operation and we should show the same information in the view till it
completes.

I'm not sure what Matthias meant, but as far as I know there's no fundamental
difference between checkpoint with and without the CHECKPOINT_IMMEDIATE flag,
and there's also no scheduling for multiple checkpoints.

Yes, the flags will remain the same but checkpoint.c will test both the passed
flags and the shmem flags to see whether a delay should be added or not, which
is the only difference in checkpoint processing for this flag. See the call to
ImmediateCheckpointRequested() which will look at the value in shmem:

/*
* Perform the usual duties and take a nap, unless we're behind schedule,
* in which case we just try to catch up as quickly as possible.
*/
if (!(flags & CHECKPOINT_IMMEDIATE) &&
!ShutdownRequestPending &&
!ImmediateCheckpointRequested() &&
IsCheckpointOnSchedule(progress))
[...]

#36

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#34)

1 attachment(s)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Thank you Alvaro and Matthias for your views. I understand your point
of not updating the progress-report flag here as it just checks
whether the CHECKPOINT_IMMEDIATE is set or not and takes an action
based on that but it doesn't change the checkpoint flags. I will
modify the code but I am a bit confused here. As per Alvaro, we need
to make the progress-report flag change in whatever is the place that
*requests* an immediate checkpoint. I feel this gives information
about the upcoming checkpoint not the current one. So updating here
provides wrong details in the view. The flags available during
CreateCheckPoint() will remain same for the entire checkpoint
operation and we should show the same information in the view till it
completes. So just removing the above piece of code (modified in
ImmediateCheckpointRequested()) in the patch will make it correct. My
opinion about maintaining a separate field to show upcoming checkpoint
flags is it makes the view complex. Please share your thoughts.

I have modified the code accordingly.
---

I think the use of capitals in CHECKPOINT and CHECKPOINTER in the
documentation is excessive.

Fixed. Here the word CHECKPOINT represents command/checkpoint
operation. If we treat it as a checkpoint operation, I agree to use
lowercase but if we treat it as command, then I think uppercase is
recommended (Refer
https://www.postgresql.org/docs/14/sql-checkpoint.html). Is it ok to
always use lowercase here?
---

(Same for terms such as MULTIXACT and
others in those docs; we typically use those in lowercase when
user-facing; and do we really use term CLOG anymore? Don't we call it
"commit log" nowadays?)

I have observed the CLOG term in the existing documentation. Anyways I
have changed MULTIXACT to multixact, SUBTRANS to subtransaction and
CLOG to commit log.
---

+   Whenever the checkpoint operation is running, the
+   <structname>pg_stat_progress_checkpoint</structname> view will contain a
+   single row indicating the progress of the checkpoint. The tables below
Maybe it should show a single row , unless the checkpointer isn't running at
all (like in single user mode).

Nice thought. Can we add an additional checkpoint phase like 'Idle'.
Idle is ON whenever the checkpointer process is running and there are
no on-going checkpoint Thoughts?
---

+ Process ID of a CHECKPOINTER process.

It's *the* checkpointer process.

Fixed.
---

pgstatfuncs.c has a whitespace issue (tab-space).

I have verified with 'git diff --check' and also manually. I did not
find any issue. Kindly mention the specific code which has an issue.
---

I suppose the functions should set provolatile.

Fixed.
---

I am storing the checkpoint start timestamp in the st_progress_param[]
and this gets set only once during the checkpoint (at the start of the
checkpoint). I have added function
pg_stat_get_progress_checkpoint_elapsed() which calculates the elapsed
time and returns a string. This function gets called whenever
pg_stat_progress_checkpoint view is queried. Kindly refer v2 patch and
share your thoughts.

I dislike the lack of access to the actual value of the checkpoint
start / checkpoint elapsed field.

As a user, if I query the pg_stat_progress_* views, my terminal or
application can easily interpret an `interval` value and cast it to
string, but the opposite is not true: the current implementation for
pg_stat_get_progress_checkpoint_elapsed loses precision. This is why
we use typed numeric fields in effectively all other places instead of
stringified versions of the values: oid fields, counters, etc are all
rendered as bigint in the view, so that no information is lost and
interpretation is trivial.

Displaying start time of the checkpoint.
---

I understand that the log based reporting is very costly and very
frequent updates are not advisable. I am planning to use the existing
infrastructure of 'log_startup_progress_interval' which provides an
option for the user to configure the interval between each progress
update. Hence it avoids frequent updates to server logs. This approach
is used only during shutdown and end-of-recovery cases because we
cannot access pg_stat_progress_checkpoint view during those scenarios.

I see; but log_startup_progress_interval seems to be exclusively
consumed through the ereport_startup_progress macro. Why put
startup/shutdown logging on the same path as the happy flow of normal
checkpoints?

You mean to say while updating the progress of the checkpoint, call
pgstat_progress_update_param() and then call
ereport_startup_progress() ?

I think that, instead of looking to what might at some point be added,
it is better to use the currently available functions instead, and
move to new functions if and when the log-based reporting requires it.

Make sense. Removing checkpoint_progress_update_param() and
checkpoint_progress_end(). I would like to concentrate on
pg_stat_progress_checkpoint view as of now and I will consider log
based reporting later.

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
With the new pg_stat_progress_checkpoint, you should also add a
backreference to this progress reporting in the CHECKPOINT sql command
documentation located in checkpoint.sgml, and maybe in wal.sgml and/or
backup.sgml too. See e.g. cluster.sgml around line 195 for an example.

I have updated in checkpoint.sqml and wal.sqml.

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
+        ( SELECT '0/0'::pg_lsn +
+                 ((CASE
+                     WHEN stat.lsn_int64 < 0 THEN pow(2::numeric, 64::numeric)::numeric
+                     ELSE 0::numeric
+                  END) +
+                  stat.lsn_int64::numeric)
+          FROM (SELECT s.param3::bigint) AS stat(lsn_int64)
+        ) AS start_lsn,

+        ('0/0'::pg_lsn +
+                 ((CASE
+                     WHEN s.param3 < 0 THEN pow(2::numeric,
64::numeric)::numeric
+                     ELSE 0::numeric
+                  END) +
+                  s.param3::numeric)
+        ) AS start_lsn,

Thanks for the suggestion. Fixed.

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
+checkpoint_progress_start(int flags)
[...]
+checkpoint_progress_update_param(int index, int64 val)
[...]
+checkpoint_progress_end(void)
+{
+    /* In bootstrap mode, we don't actually record anything. */
+    if (IsBootstrapProcessingMode())
+        return;
Disabling pgstat progress reporting when in bootstrap processing mode
/ startup/end-of-recovery makes very little sense (see upthread) and
should be removed, regardless of whether seperate functions stay.

Removed since log based reporting is not part of the current patch.

diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
+#define PROGRESS_CHECKPOINT_PHASE_INIT                          0
Generally, enum-like values in a stat_progress field are 1-indexed, to
differentiate between empty/uninitialized (0) and states that have
been set by the progress reporting infrastructure.

Fixed.

Please find the v3 patch attached and share your thoughts.

Thanks & Regards,
Nitin Jadhav
On Fri, Feb 25, 2022 at 12:23 AM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Show quoted text

I think the change to ImmediateCheckpointRequested() makes no sense.
Before this patch, that function merely inquires whether there's an
immediate checkpoint queued. After this patch, it ... changes a
progress-reporting flag? I think it would make more sense to make the
progress-report flag change in whatever is the place that *requests* an
immediate checkpoint rather than here.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
+ImmediateCheckpointRequested(int flags)
if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
+    {
+        updated_flags |= CHECKPOINT_IMMEDIATE;
I don't think that these changes are expected behaviour. Under in this
condition; the currently running checkpoint is still not 'immediate',
but it is going to hurry up for a new, actually immediate checkpoint.
Those are different kinds of checkpoint handling; and I don't think
you should modify the reported flags to show that we're going to do
stuff faster than usual. Maybe maintiain a seperate 'upcoming
checkpoint flags' field instead?
Thank you Alvaro and Matthias for your views. I understand your point
of not updating the progress-report flag here as it just checks
whether the CHECKPOINT_IMMEDIATE is set or not and takes an action
based on that but it doesn't change the checkpoint flags. I will
modify the code but I am a bit confused here. As per Alvaro, we need
to make the progress-report flag change in whatever is the place that
*requests* an immediate checkpoint. I feel this gives information
about the upcoming checkpoint not the current one. So updating here
provides wrong details in the view. The flags available during
CreateCheckPoint() will remain same for the entire checkpoint
operation and we should show the same information in the view till it
completes. So just removing the above piece of code (modified in
ImmediateCheckpointRequested()) in the patch will make it correct. My
opinion about maintaining a separate field to show upcoming checkpoint
flags is it makes the view complex. Please share your thoughts.

Thanks & Regards,

On Thu, Feb 24, 2022 at 10:45 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Wed, 23 Feb 2022 at 14:28, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Sharing the v2 patch. Kindly have a look and share your comments.

Thanks for updating.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
With the new pg_stat_progress_checkpoint, you should also add a
backreference to this progress reporting in the CHECKPOINT sql command
documentation located in checkpoint.sgml, and maybe in wal.sgml and/or
backup.sgml too. See e.g. cluster.sgml around line 195 for an example.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
+ImmediateCheckpointRequested(int flags)
if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
+    {
+        updated_flags |= CHECKPOINT_IMMEDIATE;
I don't think that these changes are expected behaviour. Under in this
condition; the currently running checkpoint is still not 'immediate',
but it is going to hurry up for a new, actually immediate checkpoint.
Those are different kinds of checkpoint handling; and I don't think
you should modify the reported flags to show that we're going to do
stuff faster than usual. Maybe maintiain a seperate 'upcoming
checkpoint flags' field instead?
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
+        ( SELECT '0/0'::pg_lsn +
+                 ((CASE
+                     WHEN stat.lsn_int64 < 0 THEN pow(2::numeric, 64::numeric)::numeric
+                     ELSE 0::numeric
+                  END) +
+                  stat.lsn_int64::numeric)
+          FROM (SELECT s.param3::bigint) AS stat(lsn_int64)
+        ) AS start_lsn,
My LSN select statement was an example that could be run directly in
psql; the so you didn't have to embed the SELECT into the view query.
The following should be sufficient (and save the planner a few cycles
otherwise spent in inlining):
+        ('0/0'::pg_lsn +
+                 ((CASE
+                     WHEN s.param3 < 0 THEN pow(2::numeric,
64::numeric)::numeric
+                     ELSE 0::numeric
+                  END) +
+                  s.param3::numeric)
+        ) AS start_lsn,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
+checkpoint_progress_start(int flags)
[...]
+checkpoint_progress_update_param(int index, int64 val)
[...]
+checkpoint_progress_end(void)
+{
+    /* In bootstrap mode, we don't actually record anything. */
+    if (IsBootstrapProcessingMode())
+        return;
Disabling pgstat progress reporting when in bootstrap processing mode
/ startup/end-of-recovery makes very little sense (see upthread) and
should be removed, regardless of whether seperate functions stay.
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
+#define PROGRESS_CHECKPOINT_PHASE_INIT                          0
Generally, enum-like values in a stat_progress field are 1-indexed, to
differentiate between empty/uninitialized (0) and states that have
been set by the progress reporting infrastructure.

Kind regards,

Matthias van de Meent

Attachments:

v3-0001-pg_stat_progress_checkpoint-view.patchapplication/octet-stream; name=v3-0001-pg_stat_progress_checkpoint-view.patchDownload

From d2e0358e4d3c498afcc7ddaa7a84a7d79e921008 Mon Sep 17 00:00:00 2001
From: Nitin Jadhav <nitinjadhav@microsoft.com>
Date: Fri, 25 Feb 2022 14:47:59 +0000
Subject: [PATCH] pg_stat_progress_checkpoint view

---
 doc/src/sgml/monitoring.sgml         | 357 +++++++++++++++++++++++++++
 doc/src/sgml/ref/checkpoint.sgml     |   6 +
 doc/src/sgml/wal.sgml                |   5 +-
 src/backend/access/transam/xlog.c    |  79 ++++++
 src/backend/catalog/system_views.sql |  35 +++
 src/backend/storage/buffer/bufmgr.c  |   7 +
 src/backend/storage/sync/sync.c      |   6 +
 src/backend/utils/adt/pgstatfuncs.c  |  49 ++++
 src/include/catalog/pg_proc.dat      |  12 +
 src/include/commands/progress.h      |  28 +++
 src/include/utils/backend_progress.h |   3 +-
 src/test/regress/expected/rules.out  |  32 +++
 12 files changed, 617 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bf7625d988..6dfc9bc392 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -401,6 +401,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
        See <xref linkend='copy-progress-reporting'/>.
       </entry>
      </row>
+
+     <row>
+      <entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
+      <entry>One row only, showing the progress of the checkpoint.
+       See <xref linkend='checkpoint-progress-reporting'/>.
+      </entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
@@ -6895,6 +6902,356 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   </table>
  </sect2>
 
+ <sect2 id="checkpoint-progress-reporting">
+  <title>Checkpoint Progress Reporting</title>
+
+  <indexterm>
+   <primary>pg_stat_progress_checkpoint</primary>
+  </indexterm>
+
+  <para>
+   Whenever the checkpoint operation is running, the
+   <structname>pg_stat_progress_checkpoint</structname> view will contain a
+   single row indicating the progress of the checkpoint. The tables below
+   describe the information that will be reported and provide information about
+   how to interpret it.
+  </para>
+
+  <table id="pg-stat-progress-checkpoint-view" xreflabel="pg_stat_progress_checkpoint">
+   <title><structname>pg_stat_progress_checkpoint</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>pid</structfield> <type>integer</type>
+      </para>
+      <para>
+       Process ID of the checkpointer process.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of checkpoint. See <xref linkend="checkpoint-types"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>kind</structfield> <type>text</type>
+      </para>
+      <para>
+       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>start_lsn</structfield> <type>text</type>
+      </para>
+      <para>
+       The checkpoint start location.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>start_time</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Elapsed time of the checkpoint.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>phase</structfield> <type>text</type>
+      </para>
+      <para>
+       Current processing phase. See <xref linkend="checkpoint-phases"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_total</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of buffers to be written. This is estimated and reported
+       as of the beginning of buffer write operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_processed</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers processed. This counter increases when the targeted
+       buffer is processed. This number will eventually become equal to
+       <literal> total_buffer_writes </literal> when the checkpoint is
+       complete.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers written. This counter only advances when the targeted
+       buffers is written. Note that some of the buffers are processed but may
+       not required to be written. So this count will always be  less than or
+       equal to  <literal>total_buffer_writes</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_total</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of files to be synced. This is estimated and reported as of
+       the beginning of sync operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of files synced. This counter advances when the targeted file is
+       synced. This number will eventually become equal to
+       <literal>total_file_syncs</literal>  when the checkpoint is complete.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-types">
+   <title>Checkpoint types</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Types</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>checkpoint</literal></entry>
+      <entry>
+       The current operation is checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>restartpoint</literal></entry>
+      <entry>
+       The current operation is restartpoint.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-kinds">
+   <title>Checkpoint kinds</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Kinds</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>shutdown</literal></entry>
+      <entry>
+       The checkpoint is for shutdown.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>end-of-recovery</literal></entry>
+      <entry>
+       The checkpoint is for end-of-recovery.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>immediate</literal></entry>
+      <entry>
+       The checkpoint is happens without delays.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>force</literal></entry>
+      <entry>
+       The checkpoint is started because some operation forced a checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>flush all</literal></entry>
+      <entry>
+       The checkpoint flushes all pages, including those belonging to unlogged
+       tables.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>wait</literal></entry>
+      <entry>
+       Wait for completion before returning.
+      </entry>
+     </row>
+      <row>
+      <entry><literal>requested</literal></entry>
+      <entry>
+       The checkpoint request has been made.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>wal</literal></entry>
+      <entry>
+       The checkpoint is started because <literal>max_wal_size</literal> is
+       reached.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>time</literal></entry>
+      <entry>
+       The checkpoint is started because <literal>checkpoint_timeout</literal>
+       expired.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-phases">
+   <title>Checkpoint phases</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Phase</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>initializing</literal></entry>
+      <entry>
+       The checkpointer process is preparing to begin the checkpoint operation.
+       This phase is expected to be very brief.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing replication slots</literal></entry>
+      <entry>
+       The checkpointer process is currently flushing all the replication slots
+       to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing snapshots</literal></entry>
+      <entry>
+       The checkpointer process is currently removing all the serialized
+       snapshots that are not required anymore.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing logical rewrite mappings</literal></entry>
+      <entry>
+       The checkpointer process is currently removing/flushing the logical
+       rewrite mappings.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing commit log pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing commit log pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing commit time stamp pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing commit time stamp pages to
+       disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing subtransaction pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing subtransaction pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing multixact pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing multixact pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing SLRU pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing SLRU pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing buffers</literal></entry>
+      <entry>
+       The checkpointer process is currently writing buffers to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>performing sync requests</literal></entry>
+      <entry>
+       The checkpointer process is currently performing sync requests.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>performing two phase checkpoint</literal></entry>
+      <entry>
+       The checkpointer process is currently performing two phase checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>recycling old XLOG files</literal></entry>
+      <entry>
+       The checkpointer process is currently recycling old XLOG files.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>finalizing</literal></entry>
+      <entry>
+       The checkpointer process is finalizing the checkpoint operation. 
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  </sect1>
 
  <sect1 id="dynamic-trace">
diff --git a/doc/src/sgml/ref/checkpoint.sgml b/doc/src/sgml/ref/checkpoint.sgml
index 1cebc03d15..a88c76533a 100644
--- a/doc/src/sgml/ref/checkpoint.sgml
+++ b/doc/src/sgml/ref/checkpoint.sgml
@@ -56,6 +56,12 @@ CHECKPOINT
    the <link linkend="predefined-roles-table"><literal>pg_checkpointer</literal></link>
    role can call <command>CHECKPOINT</command>.
   </para>
+
+  <para>
+    The checkpointer process running the checkpoint will report its progress
+    in the <structname>pg_stat_progress_checkpoint</structname> view. See
+    <xref linkend="checkpoint-progress-reporting"/> for details.
+  </para>
  </refsect1>
 
  <refsect1>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 2bb27a8468..a75d1d63d0 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -530,7 +530,10 @@
    adjust the <xref linkend="guc-archive-timeout"/> parameter rather than the
    checkpoint parameters.)
    It is also possible to force a checkpoint by using the SQL
-   command <command>CHECKPOINT</command>.
+   command <command>CHECKPOINT</command>. The checkpointer process running the 
+   checkpoint will report its progress in the
+   <structname>pg_stat_progress_checkpoint</structname> view. See
+   <xref linkend="checkpoint-progress-reporting"/> for details.
   </para>
 
   <para>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2bd7a357..af6e64c836 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -65,6 +65,7 @@
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
+#include "commands/progress.h"
 #include "common/controldata_utils.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
@@ -719,6 +720,8 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static void checkpoint_progress_start(int flags);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -6296,6 +6299,9 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	/* Prepare to report progress of the checkpoint. */
+	checkpoint_progress_start(flags);
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -6394,6 +6400,7 @@ CreateCheckPoint(int flags)
 			curInsert += SizeOfXLogShortPHD;
 	}
 	checkPoint.redo = curInsert;
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_LSN, checkPoint.redo);
 
 	/*
 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
@@ -6629,8 +6636,12 @@ CreateCheckPoint(int flags)
 		KeepLogSeg(recptr, &_logSegNo);
 	}
 	_logSegNo--;
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_OLD_XLOG_RECYCLE);
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr,
 					   checkPoint.ThisTimeLineID);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_FINALIZE);
 
 	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
@@ -6652,6 +6663,9 @@ CreateCheckPoint(int flags)
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(false);
 
+	/* Stop reporting progress of the checkpoint. */
+	pgstat_progress_end_command();
+
 	/* Reset the process title */
 	update_checkpoint_display(flags, false, true);
 
@@ -6808,29 +6822,60 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS);
 	CheckPointReplicationSlots();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS);
 	CheckPointSnapBuild();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS);
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
 
 	/* Write out all dirty data in SLRUs and the main buffer pool */
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES);
 	CheckPointCLOG();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES);
 	CheckPointCommitTs();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES);
 	CheckPointSUBTRANS();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES);
 	CheckPointMultiXact();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
 	CheckPointPredicate();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_BUFFERS);
 	CheckPointBuffers(flags);
 
 	/* Perform all queued up fsyncs */
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_FILE_SYNC);
 	ProcessSyncRequests();
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 
 	/* We deliberately delay 2PC checkpointing as long as possible */
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_TWO_PHASE);
 	CheckPointTwoPhase(checkPointRedo);
 }
 
@@ -6977,6 +7022,9 @@ CreateRestartPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	/* Prepare to report progress of the restartpoint. */
+	checkpoint_progress_start(flags);
+
 	if (log_checkpoints)
 		LogCheckpointStart(flags, true);
 
@@ -7077,7 +7125,11 @@ CreateRestartPoint(int flags)
 	if (!RecoveryInProgress())
 		replayTLI = XLogCtl->InsertTimeLineID;
 
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_OLD_XLOG_RECYCLE);
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr, replayTLI);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_FINALIZE);
 
 	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
@@ -7098,6 +7150,9 @@ CreateRestartPoint(int flags)
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(true);
 
+	/* Stop reporting progress of the restartpoint. */
+	pgstat_progress_end_command();
+
 	/* Reset the process title */
 	update_checkpoint_display(flags, true, true);
 
@@ -9197,3 +9252,27 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl->WalWriterSleeping = sleeping;
 	SpinLockRelease(&XLogCtl->info_lck);
 }
+
+/*
+ * Start reporting progress of the checkpoint.
+ */
+static void
+checkpoint_progress_start(int flags)
+{
+	const int	index[] = {
+		PROGRESS_CHECKPOINT_TIMELINE,
+		PROGRESS_CHECKPOINT_KIND,
+		PROGRESS_CHECKPOINT_PHASE,
+		PROGRESS_CHECKPOINT_START_TIMESTAMP
+		};
+	int64		val[4];
+
+	pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
+
+	val[0] = XLogCtl->InsertTimeLineID;
+	val[1] = flags;
+	val[2] = PROGRESS_CHECKPOINT_PHASE_INIT;
+	val[3] = CheckpointStats.ckpt_start_t;
+
+	pgstat_progress_update_multi_param(4, index, val);
+}
\ No newline at end of file
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3cb69b1f87..1c4c8ead9c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1286,3 +1286,38 @@ CREATE VIEW pg_stat_subscription_workers AS
           FROM pg_subscription_rel) sr,
           LATERAL pg_stat_get_subscription_worker(sr.subid, sr.relid) w
           JOIN pg_subscription s ON (w.subid = s.oid);
+
+CREATE VIEW pg_stat_progress_checkpoint AS
+    SELECT
+        S.pid AS pid,
+        pg_stat_get_progress_checkpoint_type(S.param1) AS type,
+        pg_stat_get_progress_checkpoint_kind(S.param2) AS kind,
+        ( '0/0'::pg_lsn +
+          ((CASE
+                WHEN S.param3 < 0 THEN pow(2::numeric, 64::numeric)::numeric
+                ELSE 0::numeric
+            END) +
+           S.param3::numeric)
+        ) AS start_lsn,
+        pg_stat_get_progress_checkpoint_start_time(S.param4) AS start_time,
+        CASE S.param5 WHEN 1 THEN 'initializing'
+                      WHEN 2 THEN 'checkpointing replication slots'
+                      WHEN 3 THEN 'checkpointing snapshots'
+                      WHEN 4 THEN 'checkpointing logical rewrite mappings'
+                      WHEN 5 THEN 'checkpointing commit log pages'
+                      WHEN 6 THEN 'checkpointing commit time stamp pages'
+                      WHEN 7 THEN 'checkpointing subtransaction pages'
+                      WHEN 8 THEN 'checkpointing multixact pages'
+                      WHEN 9 THEN 'checkpointing SLRU pages'
+                      WHEN 10 THEN 'checkpointing buffers'
+                      WHEN 11 THEN 'performing sync requests'
+                      WHEN 12 THEN 'performing two phase checkpoint'
+                      WHEN 13 THEN 'recycling old XLOG files'
+                      WHEN 14 THEN 'Finalizing'
+                      END AS phase,
+        S.param6 AS total_buffer_writes,
+        S.param7 AS buffers_processed,
+        S.param8 AS buffers_written,
+        S.param9 AS total_file_syncs,
+        S.param10 AS files_synced
+    FROM pg_stat_get_progress_info('CHECKPOINT') AS S;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f8..9663035d7a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "commands/progress.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -2012,6 +2013,8 @@ BufferSync(int flags)
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_TOTAL,
+								 num_to_scan);
 
 	/*
 	 * Sort buffers that need to be written to reduce the likelihood of random
@@ -2129,6 +2132,8 @@ BufferSync(int flags)
 		bufHdr = GetBufferDescriptor(buf_id);
 
 		num_processed++;
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_PROCESSED,
+									 num_processed);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -2149,6 +2154,8 @@ BufferSync(int flags)
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
+				pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_WRITTEN,
+											 num_written);
 			}
 		}
 
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e161d57761..638d3eb781 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -23,6 +23,7 @@
 #include "access/multixact.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "commands/progress.h"
 #include "commands/tablespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -356,6 +357,9 @@ ProcessSyncRequests(void)
 	/* Now scan the hashtable for fsync requests to process */
 	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOps);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_FILES_TOTAL,
+								 hash_get_num_entries(pendingOps));
+
 	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
 	{
 		int			failures;
@@ -419,6 +423,8 @@ ProcessSyncRequests(void)
 						longest = elapsed;
 					total_elapsed += elapsed;
 					processed++;
+					pgstat_progress_update_param(PROGRESS_CHECKPOINT_FILES_SYNCED,
+												 processed);
 
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 30e8dfa7c1..2ea390a2b6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -494,6 +494,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 		cmdtype = PROGRESS_COMMAND_BASEBACKUP;
 	else if (pg_strcasecmp(cmd, "COPY") == 0)
 		cmdtype = PROGRESS_COMMAND_COPY;
+	 else if (pg_strcasecmp(cmd, "CHECKPOINT") == 0)
+		cmdtype = PROGRESS_COMMAND_CHECKPOINT;
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -2495,3 +2497,50 @@ pg_stat_get_subscription_worker(PG_FUNCTION_ARGS)
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
+
+/*
+ * Return checkpoint type (either checkpoint or restartpoint).
+ */
+Datum
+pg_stat_get_progress_checkpoint_type(PG_FUNCTION_ARGS)
+{
+	TimeLineID	cur_timeline = (TimeLineID) PG_GETARG_INT64(0);
+
+	if (RecoveryInProgress() || (GetWALInsertionTimeLine() != cur_timeline))
+		PG_RETURN_TEXT_P(CStringGetTextDatum("restartpoint"));
+	else
+		PG_RETURN_TEXT_P(CStringGetTextDatum("checkpoint"));
+}
+
+/*
+ * Return checkpoint kind based on the flags set.
+ */
+Datum
+pg_stat_get_progress_checkpoint_kind(PG_FUNCTION_ARGS)
+{
+	int64	flags = PG_GETARG_INT64(0);
+	char	ckpt_kind[MAXPGPATH];
+
+	MemSet(ckpt_kind, 0, MAXPGPATH);
+	snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
+			 (flags == 0) ? "unknown" : "",
+			 (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
+			 (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
+			 (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
+			 (flags & CHECKPOINT_FORCE) ? "force " : "",
+			 (flags & CHECKPOINT_WAIT) ? "wait " : "",
+			 (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
+			 (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
+			 (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");
+
+	PG_RETURN_TEXT_P(CStringGetTextDatum(ckpt_kind));
+}
+
+/*
+ * Return start time of the checkpoint.
+ */
+Datum
+pg_stat_get_progress_checkpoint_start_time(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_TIMESTAMPTZ(PG_GETARG_INT64(0));
+}
\ No newline at end of file
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7f1ee97f55..06a84e2488 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5353,6 +5353,18 @@
   proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
   proargnames => '{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10,param11,param12,param13,param14,param15,param16,param17,param18,param19,param20}',
   prosrc => 'pg_stat_get_progress_info' },
+{ oid => '560', descr => 'return checkpoint type',
+  proname => 'pg_stat_get_progress_checkpoint_type', provolatile => 'i',
+  prorettype => 'text', proargtypes => 'int8',
+  prosrc => 'pg_stat_get_progress_checkpoint_type' },
+{ oid => '561', descr => 'return checkpoint kind',
+  proname => 'pg_stat_get_progress_checkpoint_kind',  provolatile => 'i',
+  prorettype => 'text', proargtypes => 'int8',
+  prosrc => 'pg_stat_get_progress_checkpoint_kind' },
+{ oid => '562', descr => 'return elapsed time of the checkpoint',
+  proname => 'pg_stat_get_progress_checkpoint_start_time',  provolatile => 'i',
+  prorettype => 'timestamptz', proargtypes => 'int8',
+  prosrc => 'pg_stat_get_progress_checkpoint_start_time' },
 { oid => '3099',
   descr => 'statistics: information about currently active replication',
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index a28938caf4..19e9e93072 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -151,4 +151,32 @@
 #define PROGRESS_COPY_TYPE_PIPE 3
 #define PROGRESS_COPY_TYPE_CALLBACK 4
 
+/* Progress parameters for checkpoint */
+#define PROGRESS_CHECKPOINT_TIMELINE                0
+#define PROGRESS_CHECKPOINT_KIND                    1
+#define PROGRESS_CHECKPOINT_LSN                     2
+#define PROGRESS_CHECKPOINT_START_TIMESTAMP         3
+#define PROGRESS_CHECKPOINT_PHASE                   4
+#define PROGRESS_CHECKPOINT_BUFFERS_TOTAL           5
+#define PROGRESS_CHECKPOINT_BUFFERS_PROCESSED       6
+#define PROGRESS_CHECKPOINT_BUFFERS_WRITTEN         7
+#define PROGRESS_CHECKPOINT_FILES_TOTAL             8
+#define PROGRESS_CHECKPOINT_FILES_SYNCED            9
+
+/* Phases of checkpoint (as advertised via PROGRESS_CHECKPOINT_PHASE) */
+#define PROGRESS_CHECKPOINT_PHASE_INIT                          1
+#define PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS                   2
+#define PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS                     3
+#define PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS      4
+#define PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES                    5
+#define PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES                6
+#define PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES                7
+#define PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES               8
+#define PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES                    9
+#define PROGRESS_CHECKPOINT_PHASE_BUFFERS                       10
+#define PROGRESS_CHECKPOINT_PHASE_FILE_SYNC                     11
+#define PROGRESS_CHECKPOINT_PHASE_TWO_PHASE                     12
+#define PROGRESS_CHECKPOINT_PHASE_OLD_XLOG_RECYCLE              13
+#define PROGRESS_CHECKPOINT_PHASE_FINALIZE                      14
+
 #endif
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 47bf8029b0..02d51fb948 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -27,7 +27,8 @@ typedef enum ProgressCommandType
 	PROGRESS_COMMAND_CLUSTER,
 	PROGRESS_COMMAND_CREATE_INDEX,
 	PROGRESS_COMMAND_BASEBACKUP,
-	PROGRESS_COMMAND_COPY
+	PROGRESS_COMMAND_COPY,
+	PROGRESS_COMMAND_CHECKPOINT
 } ProgressCommandType;
 
 #define PGSTAT_NUM_PROGRESS_PARAM	20
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 1420288d67..e3024975d9 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1897,6 +1897,38 @@ pg_stat_progress_basebackup| SELECT s.pid,
     s.param4 AS tablespaces_total,
     s.param5 AS tablespaces_streamed
    FROM pg_stat_get_progress_info('BASEBACKUP'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
+pg_stat_progress_checkpoint| SELECT s.pid,
+    pg_stat_get_progress_checkpoint_type(s.param1) AS type,
+    pg_stat_get_progress_checkpoint_kind(s.param2) AS kind,
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,
+    pg_stat_get_progress_checkpoint_start_time(s.param4) AS start_time,
+        CASE s.param5
+            WHEN 1 THEN 'initializing'::text
+            WHEN 2 THEN 'checkpointing replication slots'::text
+            WHEN 3 THEN 'checkpointing snapshots'::text
+            WHEN 4 THEN 'checkpointing logical rewrite mappings'::text
+            WHEN 5 THEN 'checkpointing commit log pages'::text
+            WHEN 6 THEN 'checkpointing commit time stamp pages'::text
+            WHEN 7 THEN 'checkpointing subtransaction pages'::text
+            WHEN 8 THEN 'checkpointing multixact pages'::text
+            WHEN 9 THEN 'checkpointing SLRU pages'::text
+            WHEN 10 THEN 'checkpointing buffers'::text
+            WHEN 11 THEN 'performing sync requests'::text
+            WHEN 12 THEN 'performing two phase checkpoint'::text
+            WHEN 13 THEN 'recycling old XLOG files'::text
+            WHEN 14 THEN 'Finalizing'::text
+            ELSE NULL::text
+        END AS phase,
+    s.param6 AS total_buffer_writes,
+    s.param7 AS buffers_processed,
+    s.param8 AS buffers_written,
+    s.param9 AS total_file_syncs,
+    s.param10 AS files_synced
+   FROM pg_stat_get_progress_info('CHECKPOINT'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
 pg_stat_progress_cluster| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.25.1

#37

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#29)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

+       if ((ckpt_flags &
+                (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0)
+       {
This code (present at multiple places) looks a little ugly to me, what
we can do instead is add a macro probably named IsShutdownCheckpoint()
which does the above check and use it in all the functions that have
this check. See below:

#define IsShutdownCheckpoint(flags) \
(flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY) != 0)

And then you may use this macro like:

if (IsBootstrapProcessingMode() || IsShutdownCheckpoint(flags))
return;

Good suggestion. In the v3 patch, I have removed the corresponding
code as these checks are not required. Hence this suggestion is not
applicable now.
---

pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT,
InvalidOid);
+
+               val[0] = XLogCtl->InsertTimeLineID;
+               val[1] = flags;
+               val[2] = PROGRESS_CHECKPOINT_PHASE_INIT;
+               val[3] = CheckpointStats.ckpt_start_t;
+
+               pgstat_progress_update_multi_param(4, index, val);
+       }

Any specific reason for recording the timelineID in checkpoint stats
table? Will this ever change in our case?

The timelineID is used to decide whether the current operation is
checkpoint or restartpoint. There is a field in the view to display
this information.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Wed, Feb 23, 2022 at 9:46 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

+       if ((ckpt_flags &
+                (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0)
+       {
This code (present at multiple places) looks a little ugly to me, what
we can do instead is add a macro probably named IsShutdownCheckpoint()
which does the above check and use it in all the functions that have
this check. See below:

#define IsShutdownCheckpoint(flags) \
(flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY) != 0)

And then you may use this macro like:

if (IsBootstrapProcessingMode() || IsShutdownCheckpoint(flags))
return;

This change can be done in all these functions:

+void
+checkpoint_progress_start(int flags)

--
+ */
+void
+checkpoint_progress_update_param(int index, int64 val)
--
+ * Stop reporting progress of the checkpoint.
+ */
+void
+checkpoint_progress_end(void)
==
+
pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT,
InvalidOid);
+
+               val[0] = XLogCtl->InsertTimeLineID;
+               val[1] = flags;
+               val[2] = PROGRESS_CHECKPOINT_PHASE_INIT;
+               val[3] = CheckpointStats.ckpt_start_t;
+
+               pgstat_progress_update_multi_param(4, index, val);
+       }
Any specific reason for recording the timelineID in checkpoint stats
table? Will this ever change in our case?

--
With Regards,
Ashutosh Sharma.

On Wed, Feb 23, 2022 at 6:59 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

I will make use of pgstat_progress_update_multi_param() in the next
patch to replace multiple calls to checkpoint_progress_update_param().

Fixed.
---

The other progress tables use [type]_total as column names for counter
targets (e.g. backup_total for backup_streamed, heap_blks_total for
heap_blks_scanned, etc.). I think that `buffers_total` and
`files_total` would be better column names.

I agree and I will update this in the next patch.

Fixed.
---

How about this "The checkpoint is started because max_wal_size is reached".

"The checkpoint is started because checkpoint_timeout expired".

"The checkpoint is started because some operation forced a checkpoint".

I have used the above description. Kindly let me know if any changes
are required.
---

+ <entry><literal>checkpointing CommitTs pages</literal></entry>

CommitTs -> Commit time stamp

I will handle this in the next patch.

Fixed.
---

There are more scenarios where you can have a baackend requesting a checkpoint
and waiting for its completion, and there may be more than one backend
concerned, so I don't think that storing only one / the first backend pid is
ok.

Thanks for this information. I am not considering backend_pid.
---

I think all the information should be exposed. Only knowing why the current
checkpoint has been triggered without any further information seems a bit
useless. Think for instance for cases like [1].

I have supported all possible checkpoint kinds. Added
pg_stat_get_progress_checkpoint_kind() to convert the flags (int) to a
string representing a combination of flags and also checking for the
flag update in ImmediateCheckpointRequested() which checks whether
CHECKPOINT_IMMEDIATE flag is set or not. I did not find any other
cases where the flags get changed (which changes the current
checkpoint behaviour) during the checkpoint. Kindly let me know if I
am missing something.
---

I feel 'processes_wiating' aligns more with the naming conventions of
the fields of the existing progres views.

There's at least pg_stat_progress_vacuum.num_dead_tuples. Anyway I don't have
a strong opinion on it, just make sure to correct the typo.

More analysis is required to support this. I am planning to take care
in the next patch.
---

If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
restartpoint if the checkpoint's timeline is different from the current
timeline?

Fixed.

Sharing the v2 patch. Kindly have a look and share your comments.

Thanks & Regards,
Nitin Jadhav

On Tue, Feb 22, 2022 at 12:08 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Thank you for sharing the information. 'triggering backend PID' (int)
- can be stored without any problem. 'checkpoint or restartpoint?'
(boolean) - can be stored as a integer value like
PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elapsed time' (store as
start time in stat_progress, timestamp fits in 64 bits) - As
Timestamptz is of type int64 internally, so we can store the timestamp
value in the progres parameter and then expose a function like
'pg_stat_get_progress_checkpoint_elapsed' which takes int64 (not
Timestamptz) as argument and then returns string representing the
elapsed time.

No need to use a string there; I think exposing the checkpoint start
time is good enough. The conversion of int64 to timestamp[tz] can be
done in SQL (although I'm not sure that exposing the internal bitwise
representation of Interval should be exposed to that extent) [0].
Users can then extract the duration interval using now() - start_time,
which also allows the user to use their own preferred formatting.

The reason for showing the elapsed time rather than exposing the
timestamp directly is in case of checkpoint during shutdown and
end-of-recovery, I am planning to log a message in server logs using
'log_startup_progress_interval' infrastructure which displays elapsed
time. So just to match both of the behaviour I am displaying elapsed
time here. I feel that elapsed time gives a quicker feel of the
progress. Kindly let me know if you still feel just exposing the
timestamp is better than showing the elapsed time.

'checkpoint start location' (lsn = uint64) - I feel we
cannot use progress parameters for this case. As assigning uint64 to
int64 type would be an issue for larger values and can lead to hidden
bugs.

Not necessarily - we can (without much trouble) do a bitwise cast from
uint64 to int64, and then (in SQL) cast it back to a pg_lsn [1]. Not
very elegant, but it works quite well.

[1] SELECT '0/0'::pg_lsn + ((CASE WHEN stat.my_int64 < 0 THEN
pow(2::numeric, 64::numeric)::numeric ELSE 0::numeric END) +
stat.my_int64::numeric) FROM (SELECT -2::bigint /* 0xFFFFFFFF/FFFFFFFE
*/ AS my_bigint_lsn) AS stat(my_int64);

Thanks for sharing. It works. I will include this in the next patch.
On Sat, Feb 19, 2022 at 11:02 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

On Fri, Feb 18, 2022 at 08:07:05PM +0530, Nitin Jadhav wrote:

The backend_pid contains a valid value only during
the CHECKPOINT command issued by the backend explicitly, otherwise the
value will be 0. We may have to add an additional field to
'CheckpointerShmemStruct' to hold the backend pid. The backend
requesting the checkpoint will update its pid to this structure.
Kindly let me know if you still feel the backend_pid field is not
necessary.

There are more scenarios where you can have a baackend requesting a checkpoint
and waiting for its completion, and there may be more than one backend
concerned, so I don't think that storing only one / the first backend pid is
ok.

And also while looking at the patch I see there's the same problem that I
mentioned in the previous thread, which is that the effective flags can be
updated once the checkpoint started, and as-is the view won't reflect that. It
also means that you can't simply display one of wal, time or force but a
possible combination of the flags (including the one not handled in v1).

If I understand the above comment properly, it has 2 points. First is
to display the combination of flags rather than just displaying wal,
time or force - The idea behind this is to just let the user know the
reason for checkpointing. That is, the checkpoint is started because
max_wal_size is reached or checkpoint_timeout expired or explicitly
issued CHECKPOINT command. The other flags like CHECKPOINT_IMMEDIATE,
CHECKPOINT_WAIT or CHECKPOINT_FLUSH_ALL indicate how the checkpoint
has to be performed. Hence I have not included those in the view. If
it is really required, I would like to modify the code to include
other flags and display the combination.

I think all the information should be exposed. Only knowing why the current
checkpoint has been triggered without any further information seems a bit
useless. Think for instance for cases like [1].

Second point is to reflect
the updated flags in the view. AFAIK, there is a possibility that the
flags get updated during the on-going checkpoint but the reason for
checkpoint (wal, time or force) will remain same for the current
checkpoint. There might be a change in how checkpoint has to be
performed if CHECKPOINT_IMMEDIATE flag is set. If we go with
displaying the combination of flags in the view, then probably we may
have to reflect this in the view.

You can only "upgrade" a checkpoint, but not "downgrade" it. So if for
instance you find both CHECKPOINT_CAUSE_TIME and CHECKPOINT_FORCE (which is
possible) you can easily know which one was the one that triggered the
checkpoint and which one was added later.

Probably a new field named 'processes_wiating' or 'events_waiting' can be
added for this purpose.

Maybe num_process_waiting?

I feel 'processes_wiating' aligns more with the naming conventions of
the fields of the existing progres views.

There's at least pg_stat_progress_vacuum.num_dead_tuples. Anyway I don't have
a strong opinion on it, just make sure to correct the typo.

Probably writing of buffers or syncing files may complete before
pg_is_in_recovery() returns false. But there are some cleanup
operations happen as part of the checkpoint. During this scenario, we
may get false value for pg_is_in_recovery(). Please refer following
piece of code which is present in CreateRestartpoint().

if (!RecoveryInProgress())
replayTLI = XLogCtl->InsertTimeLineID;

Then maybe we could store the timeline rather then then kind of checkpoint?
You should still be able to compute the information while giving a bit more
information for the same memory usage.

Can you please describe more about how checkpoint/restartpoint can be
confirmed using the timeline id.

If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
restartpoint if the checkpoint's timeline is different from the current
timeline?

[1] /messages/by-id/1486805889.24568.96.camel@credativ.de

#38

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#35)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Thank you Alvaro and Matthias for your views. I understand your point
of not updating the progress-report flag here as it just checks
whether the CHECKPOINT_IMMEDIATE is set or not and takes an action
based on that but it doesn't change the checkpoint flags. I will
modify the code but I am a bit confused here. As per Alvaro, we need
to make the progress-report flag change in whatever is the place that
*requests* an immediate checkpoint. I feel this gives information
about the upcoming checkpoint not the current one. So updating here
provides wrong details in the view. The flags available during
CreateCheckPoint() will remain same for the entire checkpoint
operation and we should show the same information in the view till it
completes.

I'm not sure what Matthias meant, but as far as I know there's no fundamental
difference between checkpoint with and without the CHECKPOINT_IMMEDIATE flag,
and there's also no scheduling for multiple checkpoints.

Yes, the flags will remain the same but checkpoint.c will test both the passed
flags and the shmem flags to see whether a delay should be added or not, which
is the only difference in checkpoint processing for this flag. See the call to
ImmediateCheckpointRequested() which will look at the value in shmem:

/*
* Perform the usual duties and take a nap, unless we're behind schedule,
* in which case we just try to catch up as quickly as possible.
*/
if (!(flags & CHECKPOINT_IMMEDIATE) &&
!ShutdownRequestPending &&
!ImmediateCheckpointRequested() &&
IsCheckpointOnSchedule(progress))

I understand that the checkpointer considers flags as well as the
shmem flags and if CHECKPOINT_IMMEDIATE flag is set, it affects the
current checkpoint operation (No further delay) but does not change
the current flag value. Should we display this change in the kind
field of the view or not? Please share your thoughts.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Fri, Feb 25, 2022 at 12:33 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

On Fri, Feb 25, 2022 at 12:23:27AM +0530, Nitin Jadhav wrote:
I think the change to ImmediateCheckpointRequested() makes no sense.
Before this patch, that function merely inquires whether there's an
immediate checkpoint queued. After this patch, it ... changes a
progress-reporting flag? I think it would make more sense to make the
progress-report flag change in whatever is the place that *requests* an
immediate checkpoint rather than here.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
+ImmediateCheckpointRequested(int flags)
if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
+    {
+        updated_flags |= CHECKPOINT_IMMEDIATE;
I don't think that these changes are expected behaviour. Under in this
condition; the currently running checkpoint is still not 'immediate',
but it is going to hurry up for a new, actually immediate checkpoint.
Those are different kinds of checkpoint handling; and I don't think
you should modify the reported flags to show that we're going to do
stuff faster than usual. Maybe maintiain a seperate 'upcoming
checkpoint flags' field instead?
Thank you Alvaro and Matthias for your views. I understand your point
of not updating the progress-report flag here as it just checks
whether the CHECKPOINT_IMMEDIATE is set or not and takes an action
based on that but it doesn't change the checkpoint flags. I will
modify the code but I am a bit confused here. As per Alvaro, we need
to make the progress-report flag change in whatever is the place that
*requests* an immediate checkpoint. I feel this gives information
about the upcoming checkpoint not the current one. So updating here
provides wrong details in the view. The flags available during
CreateCheckPoint() will remain same for the entire checkpoint
operation and we should show the same information in the view till it
completes.
I'm not sure what Matthias meant, but as far as I know there's no fundamental
difference between checkpoint with and without the CHECKPOINT_IMMEDIATE flag,
and there's also no scheduling for multiple checkpoints.

Yes, the flags will remain the same but checkpoint.c will test both the passed
flags and the shmem flags to see whether a delay should be added or not, which
is the only difference in checkpoint processing for this flag. See the call to
ImmediateCheckpointRequested() which will look at the value in shmem:

/*
* Perform the usual duties and take a nap, unless we're behind schedule,
* in which case we just try to catch up as quickly as possible.
*/
if (!(flags & CHECKPOINT_IMMEDIATE) &&
!ShutdownRequestPending &&
!ImmediateCheckpointRequested() &&
IsCheckpointOnSchedule(progress))
[...]

#39

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#38)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Fri, Feb 25, 2022 at 08:53:50PM +0530, Nitin Jadhav wrote:

I'm not sure what Matthias meant, but as far as I know there's no fundamental
difference between checkpoint with and without the CHECKPOINT_IMMEDIATE flag,
and there's also no scheduling for multiple checkpoints.

Yes, the flags will remain the same but checkpoint.c will test both the passed
flags and the shmem flags to see whether a delay should be added or not, which
is the only difference in checkpoint processing for this flag. See the call to
ImmediateCheckpointRequested() which will look at the value in shmem:

/*
* Perform the usual duties and take a nap, unless we're behind schedule,
* in which case we just try to catch up as quickly as possible.
*/
if (!(flags & CHECKPOINT_IMMEDIATE) &&
!ShutdownRequestPending &&
!ImmediateCheckpointRequested() &&
IsCheckpointOnSchedule(progress))

I understand that the checkpointer considers flags as well as the
shmem flags and if CHECKPOINT_IMMEDIATE flag is set, it affects the
current checkpoint operation (No further delay) but does not change
the current flag value. Should we display this change in the kind
field of the view or not? Please share your thoughts.

I think the fields should be added. It's good to know that a checkpoint was
trigger due to normal activity and should be spreaded, and then something
upgraded it to an immediate checkpoint. If you're desperately waiting for the
end of a checkpoint for some reason and ask for an immediate checkpoint, you'll
certainly be happy to see that the checkpointer is aware of it.

But maybe I missed something in the code, so let's wait for Matthias input
about it.

#40

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#39)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Fri, 25 Feb 2022 at 17:35, Julien Rouhaud <rjuju123@gmail.com> wrote:

On Fri, Feb 25, 2022 at 08:53:50PM +0530, Nitin Jadhav wrote:

I'm not sure what Matthias meant, but as far as I know there's no fundamental
difference between checkpoint with and without the CHECKPOINT_IMMEDIATE flag,
and there's also no scheduling for multiple checkpoints.

Yes, the flags will remain the same but checkpoint.c will test both the passed
flags and the shmem flags to see whether a delay should be added or not, which
is the only difference in checkpoint processing for this flag. See the call to
ImmediateCheckpointRequested() which will look at the value in shmem:

/*
* Perform the usual duties and take a nap, unless we're behind schedule,
* in which case we just try to catch up as quickly as possible.
*/
if (!(flags & CHECKPOINT_IMMEDIATE) &&
!ShutdownRequestPending &&
!ImmediateCheckpointRequested() &&
IsCheckpointOnSchedule(progress))

I understand that the checkpointer considers flags as well as the
shmem flags and if CHECKPOINT_IMMEDIATE flag is set, it affects the
current checkpoint operation (No further delay) but does not change
the current flag value. Should we display this change in the kind
field of the view or not? Please share your thoughts.

I think the fields should be added. It's good to know that a checkpoint was
trigger due to normal activity and should be spreaded, and then something
upgraded it to an immediate checkpoint. If you're desperately waiting for the
end of a checkpoint for some reason and ask for an immediate checkpoint, you'll
certainly be happy to see that the checkpointer is aware of it.

But maybe I missed something in the code, so let's wait for Matthias input
about it.

The point I was trying to make was "If cps->ckpt_flags is
CHECKPOINT_IMMEDIATE, we hurry up to start the new checkpoint that is
actually immediate". That doesn't mean that this checkpoint was
created with IMMEDIATE or running using IMMEDIATE, only that optional
delays are now being skipped instead.

To let the user detect _why_ the optional delays are now being
skipped, I propose not to report this currently running checkpoint's
"flags | CHECKPOINT_IMMEDIATE", but to add reporting of the next
checkpoint's flags; which would allow the detection and display of the
CHECKPOINT_IMMEDIATE we're actually hurrying for (plus some more
interesting information flags.

-Matthias

PS. I just noticed that the checkpoint flags are also being parsed and
stringified twice in LogCheckpointStart; and adding another duplicate
in the current code would put that at 3 copies of effectively the same
code. Do we maybe want to deduplicate that into macros, similar to
LSN_FORMAT_ARGS?

#41

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Matthias van de Meent (#40)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Fri, Feb 25, 2022 at 06:49:42PM +0100, Matthias van de Meent wrote:

The point I was trying to make was "If cps->ckpt_flags is
CHECKPOINT_IMMEDIATE, we hurry up to start the new checkpoint that is
actually immediate". That doesn't mean that this checkpoint was
created with IMMEDIATE or running using IMMEDIATE, only that optional
delays are now being skipped instead.

Ah, I now see what you mean.

To let the user detect _why_ the optional delays are now being
skipped, I propose not to report this currently running checkpoint's
"flags | CHECKPOINT_IMMEDIATE", but to add reporting of the next
checkpoint's flags; which would allow the detection and display of the
CHECKPOINT_IMMEDIATE we're actually hurrying for (plus some more
interesting information flags.

I'm still not convinced that's a sensible approach. The next checkpoint will
be displayed in the view as CHECKPOINT_IMMEDIATE, so you will then know about
it. I'm not sure that having that specific information in the view is
going to help, especially if users have to understand "a slow checkpoint is
actually fast even if it's displayed as slow if the next checkpoint is going to
be fast". Saying "it's timed" (which imply slow) and "it's fast" is maybe
still counter intuitive, but at least have a better chance to see there's
something going on and refer to the doc if you don't get it.

#42

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#41)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Sat, Feb 26, 2022 at 02:30:36AM +0800, Julien Rouhaud wrote:

On Fri, Feb 25, 2022 at 06:49:42PM +0100, Matthias van de Meent wrote:

The point I was trying to make was "If cps->ckpt_flags is
CHECKPOINT_IMMEDIATE, we hurry up to start the new checkpoint that is
actually immediate". That doesn't mean that this checkpoint was
created with IMMEDIATE or running using IMMEDIATE, only that optional
delays are now being skipped instead.

Ah, I now see what you mean.

To let the user detect _why_ the optional delays are now being
skipped, I propose not to report this currently running checkpoint's
"flags | CHECKPOINT_IMMEDIATE", but to add reporting of the next
checkpoint's flags; which would allow the detection and display of the
CHECKPOINT_IMMEDIATE we're actually hurrying for (plus some more
interesting information flags.

I'm still not convinced that's a sensible approach. The next checkpoint will
be displayed in the view as CHECKPOINT_IMMEDIATE, so you will then know about
it. I'm not sure that having that specific information in the view is
going to help, especially if users have to understand "a slow checkpoint is
actually fast even if it's displayed as slow if the next checkpoint is going to
be fast". Saying "it's timed" (which imply slow) and "it's fast" is maybe
still counter intuitive, but at least have a better chance to see there's
something going on and refer to the doc if you don't get it.

Just to be clear, I do think that it's worthwhile to add some information that
some backends are waiting for that next checkpoint. As discussed before, an
int for the number of backends looks like enough information to me.

#43

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#37)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Fri, Feb 25, 2022 at 8:38 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Had a quick look over the v3 patch. I'm not sure if it's the best way
to have pg_stat_get_progress_checkpoint_type,
pg_stat_get_progress_checkpoint_kind and
pg_stat_get_progress_checkpoint_start_time just for printing info in
readable format in pg_stat_progress_checkpoint. I don't think these
functions will ever be useful for the users.

1) Can't we use pg_is_in_recovery to determine if it's a restartpoint
or checkpoint instead of having a new function
pg_stat_get_progress_checkpoint_type?

2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
directly instead of new function pg_stat_get_progress_checkpoint_kind?
+ snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
+ (flags == 0) ? "unknown" : "",
+ (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
+ (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
+ (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
+ (flags & CHECKPOINT_FORCE) ? "force " : "",
+ (flags & CHECKPOINT_WAIT) ? "wait " : "",
+ (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
+ (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
+ (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");

3) Why do we need this extra calculation for start_lsn? Do you ever
see a negative LSN or something here?
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,

4) Can't you use timestamptz_in(to_char(s.param4)) instead of
pg_stat_get_progress_checkpoint_start_time? I don't quite understand
the reasoning for having this function and it's named as *checkpoint*
when it doesn't do anything specific to the checkpoint at all?

Having 3 unnecessary functions that aren't useful to the users at all
in proc.dat will simply eatup the function oids IMO. Hence, I suggest
let's try to do without extra functions.

Regards,
Bharath Rupireddy.

#44

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 4 years ago

In reply to: Bharath Rupireddy (#43)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Sun, Feb 27, 2022 at 8:44 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Feb 25, 2022 at 8:38 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Had a quick look over the v3 patch. I'm not sure if it's the best way
to have pg_stat_get_progress_checkpoint_type,
pg_stat_get_progress_checkpoint_kind and
pg_stat_get_progress_checkpoint_start_time just for printing info in
readable format in pg_stat_progress_checkpoint. I don't think these
functions will ever be useful for the users.

1) Can't we use pg_is_in_recovery to determine if it's a restartpoint
or checkpoint instead of having a new function
pg_stat_get_progress_checkpoint_type?
2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
directly instead of new function pg_stat_get_progress_checkpoint_kind?
+ snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
+ (flags == 0) ? "unknown" : "",
+ (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
+ (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
+ (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
+ (flags & CHECKPOINT_FORCE) ? "force " : "",
+ (flags & CHECKPOINT_WAIT) ? "wait " : "",
+ (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
+ (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
+ (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");
3) Why do we need this extra calculation for start_lsn? Do you ever
see a negative LSN or something here?
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,
4) Can't you use timestamptz_in(to_char(s.param4)) instead of
pg_stat_get_progress_checkpoint_start_time? I don't quite understand
the reasoning for having this function and it's named as *checkpoint*
when it doesn't do anything specific to the checkpoint at all?

Having 3 unnecessary functions that aren't useful to the users at all
in proc.dat will simply eatup the function oids IMO. Hence, I suggest
let's try to do without extra functions.

Another thought for my review comment:

1) Can't we use pg_is_in_recovery to determine if it's a restartpoint
or checkpoint instead of having a new function
pg_stat_get_progress_checkpoint_type?

I don't think using pg_is_in_recovery work here as it is taken after
the checkpoint has started. So, I think the right way here is to send
1 in CreateCheckPoint and 2 in CreateRestartPoint and use
CASE-WHEN-ELSE-END to show "1": "checkpoint" "2":"restartpoint".

Continuing my review:

5) Do we need a special phase for this checkpoint operation? I'm not
sure in which cases it will take a long time, but it looks like
there's a wait loop here.
vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
if (nvxids > 0)
{
do
{
pg_usleep(10000L); /* wait for 10 msec */
} while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
}

Also, how about special phases for SyncPostCheckpoint(),
SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
it might be increase in future (?)), TruncateSUBTRANS()?

6) SLRU (Simple LRU) isn't a phase here, you can just say
PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES.
+
+ pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+ PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
  CheckPointPredicate();

And :s/checkpointing SLRU pages/checkpointing predicate lock pages
+ WHEN 9 THEN 'checkpointing SLRU pages'

7) :s/PROGRESS_CHECKPOINT_PHASE_FILE_SYNC/PROGRESS_CHECKPOINT_PHASE_PROCESS_FILE_SYNC_REQUESTS

And :s/WHEN 11 THEN 'performing sync requests'/WHEN 11 THEN
'processing file sync requests'

8) :s/Finalizing/finalizing
+ WHEN 14 THEN 'Finalizing'

9) :s/checkpointing snapshots/checkpointing logical replication snapshot files
+                      WHEN 3 THEN 'checkpointing snapshots'
:s/checkpointing logical rewrite mappings/checkpointing logical
replication rewrite mapping files
+                      WHEN 4 THEN 'checkpointing logical rewrite mappings'

10) I'm not sure if it's discussed, how about adding the number of
snapshot/mapping files so far the checkpoint has processed in file
processing while loops of
CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
be many logical snapshot or mapping files and users may be interested
in knowing the so-far-processed-file-count.

11) I think it's discussed, are we going to add the pid of the
checkpoint requestor?

Regards,
Bharath Rupireddy.

#45

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Bharath Rupireddy (#44)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

On Mon, Feb 28, 2022 at 10:21:23AM +0530, Bharath Rupireddy wrote:

Another thought for my review comment:

1) Can't we use pg_is_in_recovery to determine if it's a restartpoint
or checkpoint instead of having a new function
pg_stat_get_progress_checkpoint_type?

I don't think using pg_is_in_recovery work here as it is taken after
the checkpoint has started. So, I think the right way here is to send
1 in CreateCheckPoint and 2 in CreateRestartPoint and use
CASE-WHEN-ELSE-END to show "1": "checkpoint" "2":"restartpoint".

I suggested upthread to store the starting timeline instead. This way you can
deduce whether it's a restartpoint or a checkpoint, but you can also deduce
other information, like what was the starting WAL.

11) I think it's discussed, are we going to add the pid of the
checkpoint requestor?

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.

#46

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#45)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Mon, Feb 28, 2022 at 12:02 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

On Mon, Feb 28, 2022 at 10:21:23AM +0530, Bharath Rupireddy wrote:

Another thought for my review comment:

1) Can't we use pg_is_in_recovery to determine if it's a restartpoint
or checkpoint instead of having a new function
pg_stat_get_progress_checkpoint_type?

I don't think using pg_is_in_recovery work here as it is taken after
the checkpoint has started. So, I think the right way here is to send
1 in CreateCheckPoint and 2 in CreateRestartPoint and use
CASE-WHEN-ELSE-END to show "1": "checkpoint" "2":"restartpoint".

I suggested upthread to store the starting timeline instead. This way you can
deduce whether it's a restartpoint or a checkpoint, but you can also deduce
other information, like what was the starting WAL.

I don't understand why we need the timeline here to just determine
whether it's a restartpoint or checkpoint. I know that the
InsertTimeLineID is 0 during recovery. IMO, emitting 1 for checkpoint
and 2 for restartpoint in CreateCheckPoint and CreateRestartPoint
respectively and using CASE-WHEN-ELSE-END to show it in readable
format is the easiest way.

Can't the checkpoint start LSN be deduced from
PROGRESS_CHECKPOINT_LSN, checkPoint.redo?

I'm completely against these pg_stat_get_progress_checkpoint_{type,
kind, start_time} functions unless there's a strong case. IMO, we can
achieve what we want without these functions as well.

11) I think it's discussed, are we going to add the pid of the
checkpoint requestor?

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.

Yeah, you are right. Let's not go that path and store an array of
pids. I don't see a strong use-case with the pid of the process
requesting checkpoint. If required, we can add it later once the
pg_stat_progress_checkpoint view gets in.

Regards,
Bharath Rupireddy.

#47

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Bharath Rupireddy (#46)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Mon, Feb 28, 2022 at 06:03:54PM +0530, Bharath Rupireddy wrote:

On Mon, Feb 28, 2022 at 12:02 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

I suggested upthread to store the starting timeline instead. This way you can
deduce whether it's a restartpoint or a checkpoint, but you can also deduce
other information, like what was the starting WAL.

I don't understand why we need the timeline here to just determine
whether it's a restartpoint or checkpoint.

I'm not saying it's necessary, I'm saying that for the same space usage we can
store something a bit more useful. If no one cares about having the starting
timeline available for no extra cost then sure, let's just store the kind
directly.

Can't the checkpoint start LSN be deduced from
PROGRESS_CHECKPOINT_LSN, checkPoint.redo?

I'm not sure I'm following, isn't checkPoint.redo the checkpoint start LSN?

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.

Yeah, you are right. Let's not go that path and store an array of
pids. I don't see a strong use-case with the pid of the process
requesting checkpoint. If required, we can add it later once the
pg_stat_progress_checkpoint view gets in.

I don't think that's really necessary to give the pid list.

If you requested a new checkpoint, it doesn't matter if it's only your backend
that triggered it, another backend or a few other dozen, the result will be the
same and you have the information that the request has been seen. We could
store just a bool for that but having a number instead also gives a bit more
information and may allow you to detect some broken logic on your client code
if it keeps increasing.

#48

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 4 years ago

In reply to: Bharath Rupireddy (#43)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Sun, 27 Feb 2022 at 16:14, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

3) Why do we need this extra calculation for start_lsn?
Do you ever see a negative LSN or something here?
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,

Yes: LSN can take up all of an uint64; whereas the pgstat column is a
bigint type; thus the signed int64. This cast is OK as it wraps
around, but that means we have to take care to correctly display the
LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
the special-casing for negative values.

As to whether it is reasonable: Generating 16GB of wal every second
(2^34 bytes /sec) is probably not impossible (cpu <> memory bandwidth
has been > 20GB/sec for a while); and that leaves you 2^29 seconds of
database runtime; or about 17 years. Seeing that a cluster can be
`pg_upgrade`d (which doesn't reset cluster LSN) since PG 9.0 from at
least version PG 8.4.0 (2009) (and through pg_migrator, from 8.3.0)),
we can assume that clusters hitting LSN=2^63 will be a reasonable
possibility within the next few years. As the lifespan of a PG release
is about 5 years, it doesn't seem impossible that there will be actual
clusters that are going to hit this naturally in the lifespan of PG15.

It is also possible that someone fat-fingers pg_resetwal; and creates
a cluster with LSN >= 2^63; resulting in negative values in the
s.param3 field. Not likely, but we can force such situations; and as
such we should handle that gracefully.

4) Can't you use timestamptz_in(to_char(s.param4)) instead of
pg_stat_get_progress_checkpoint_start_time? I don't quite understand
the reasoning for having this function and it's named as *checkpoint*
when it doesn't do anything specific to the checkpoint at all?

I hadn't thought of using the types' inout functions, but it looks
like timestamp IO functions use a formatted timestring, which won't
work with the epoch-based timestamp stored in the view.

Having 3 unnecessary functions that aren't useful to the users at all
in proc.dat will simply eatup the function oids IMO. Hence, I suggest
let's try to do without extra functions.

I agree that (1) could be simplified, or at least fully expressed in
SQL without exposing too many internals. If we're fine with exposing
internals like flags and type layouts, then (2), and arguably (4), can
be expressed in SQL as well.

-Matthias

#49

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Matthias van de Meent (#48)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

3) Why do we need this extra calculation for start_lsn?
Do you ever see a negative LSN or something here?
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,
Yes: LSN can take up all of an uint64; whereas the pgstat column is a
bigint type; thus the signed int64. This cast is OK as it wraps
around, but that means we have to take care to correctly display the
LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
the special-casing for negative values.

Yes. The extra calculation is required here as we are storing unit64
value in the variable of type int64. When we convert uint64 to int64
then the bit pattern is preserved (so no data is lost). The high-order
bit becomes the sign bit and if the sign bit is set, both the sign and
magnitude of the value changes. To safely get the actual uint64 value
whatever was assigned, we need the above calculations.

4) Can't you use timestamptz_in(to_char(s.param4)) instead of
pg_stat_get_progress_checkpoint_start_time? I don't quite understand
the reasoning for having this function and it's named as *checkpoint*
when it doesn't do anything specific to the checkpoint at all?

I hadn't thought of using the types' inout functions, but it looks
like timestamp IO functions use a formatted timestring, which won't
work with the epoch-based timestamp stored in the view.

There is a variation of to_timestamp() which takes UNIX epoch (float8)
as an argument and converts it to timestamptz but we cannot directly
call this function with S.param4.

TimestampTz
GetCurrentTimestamp(void)
{
TimestampTz result;
struct timeval tp;

gettimeofday(&tp, NULL);

result = (TimestampTz) tp.tv_sec -
((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
result = (result * USECS_PER_SEC) + tp.tv_usec;

return result;
}

S.param4 contains the output of the above function
(GetCurrentTimestamp()) which returns Postgres epoch but the
to_timestamp() expects UNIX epoch as input. So some calculation is
required here. I feel the SQL 'to_timestamp(946684800 +
(S.param4::float / 1000000)) AS start_time' works fine. The value
'946684800' is equal to ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) *
SECS_PER_DAY). I am not sure whether it is good practice to use this
way. Kindly share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Mon, Feb 28, 2022 at 6:40 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Show quoted text

On Sun, 27 Feb 2022 at 16:14, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
3) Why do we need this extra calculation for start_lsn?
Do you ever see a negative LSN or something here?
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,
Yes: LSN can take up all of an uint64; whereas the pgstat column is a
bigint type; thus the signed int64. This cast is OK as it wraps
around, but that means we have to take care to correctly display the
LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
the special-casing for negative values.

As to whether it is reasonable: Generating 16GB of wal every second
(2^34 bytes /sec) is probably not impossible (cpu <> memory bandwidth
has been > 20GB/sec for a while); and that leaves you 2^29 seconds of
database runtime; or about 17 years. Seeing that a cluster can be
`pg_upgrade`d (which doesn't reset cluster LSN) since PG 9.0 from at
least version PG 8.4.0 (2009) (and through pg_migrator, from 8.3.0)),
we can assume that clusters hitting LSN=2^63 will be a reasonable
possibility within the next few years. As the lifespan of a PG release
is about 5 years, it doesn't seem impossible that there will be actual
clusters that are going to hit this naturally in the lifespan of PG15.

It is also possible that someone fat-fingers pg_resetwal; and creates
a cluster with LSN >= 2^63; resulting in negative values in the
s.param3 field. Not likely, but we can force such situations; and as
such we should handle that gracefully.

4) Can't you use timestamptz_in(to_char(s.param4)) instead of
pg_stat_get_progress_checkpoint_start_time? I don't quite understand
the reasoning for having this function and it's named as *checkpoint*
when it doesn't do anything specific to the checkpoint at all?

I hadn't thought of using the types' inout functions, but it looks
like timestamp IO functions use a formatted timestring, which won't
work with the epoch-based timestamp stored in the view.

Having 3 unnecessary functions that aren't useful to the users at all
in proc.dat will simply eatup the function oids IMO. Hence, I suggest
let's try to do without extra functions.

I agree that (1) could be simplified, or at least fully expressed in
SQL without exposing too many internals. If we're fine with exposing
internals like flags and type layouts, then (2), and arguably (4), can
be expressed in SQL as well.

-Matthias

#50

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#49)

1 attachment(s)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Thanks for reviewing.

I suggested upthread to store the starting timeline instead. This way you can
deduce whether it's a restartpoint or a checkpoint, but you can also deduce
other information, like what was the starting WAL.

I don't understand why we need the timeline here to just determine
whether it's a restartpoint or checkpoint.

I'm not saying it's necessary, I'm saying that for the same space usage we can
store something a bit more useful. If no one cares about having the starting
timeline available for no extra cost then sure, let's just store the kind
directly.

Fixed.

2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
directly instead of new function pg_stat_get_progress_checkpoint_kind?
+ snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
+ (flags == 0) ? "unknown" : "",
+ (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
+ (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
+ (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
+ (flags & CHECKPOINT_FORCE) ? "force " : "",
+ (flags & CHECKPOINT_WAIT) ? "wait " : "",
+ (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
+ (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
+ (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");

Fixed.
---

5) Do we need a special phase for this checkpoint operation? I'm not
sure in which cases it will take a long time, but it looks like
there's a wait loop here.
vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
if (nvxids > 0)
{
do
{
pg_usleep(10000L); /* wait for 10 msec */
} while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
}

Yes. It is better to add a separate phase here.
---

Also, how about special phases for SyncPostCheckpoint(),
SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
it might be increase in future (?)), TruncateSUBTRANS()?

SyncPreCheckpoint() is just incrementing a counter and
PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
there is no need to add any phases for these as of now. We can add in
the future if necessary. Added phases for SyncPostCheckpoint(),
InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().
---

6) SLRU (Simple LRU) isn't a phase here, you can just say
PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES.
+
+ pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+ PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
CheckPointPredicate();
And :s/checkpointing SLRU pages/checkpointing predicate lock pages
+ WHEN 9 THEN 'checkpointing SLRU pages'

Fixed.
---

7) :s/PROGRESS_CHECKPOINT_PHASE_FILE_SYNC/PROGRESS_CHECKPOINT_PHASE_PROCESS_FILE_SYNC_REQUESTS

I feel PROGRESS_CHECKPOINT_PHASE_FILE_SYNC is a better option here as
it describes the purpose in less words.

And :s/WHEN 11 THEN 'performing sync requests'/WHEN 11 THEN
'processing file sync requests'

Fixed.
---

8) :s/Finalizing/finalizing
+ WHEN 14 THEN 'Finalizing'

Fixed.
---

9) :s/checkpointing snapshots/checkpointing logical replication snapshot files
+                      WHEN 3 THEN 'checkpointing snapshots'
:s/checkpointing logical rewrite mappings/checkpointing logical
replication rewrite mapping files
+                      WHEN 4 THEN 'checkpointing logical rewrite mappings'

Fixed.
---

10) I'm not sure if it's discussed, how about adding the number of
snapshot/mapping files so far the checkpoint has processed in file
processing while loops of
CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
be many logical snapshot or mapping files and users may be interested
in knowing the so-far-processed-file-count.

I had thought about this while sharing the v1 patch and mentioned my
views upthread. I feel it won't give meaningful progress information
(It can be treated as statistics). Hence not included. Thoughts?

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.

Yeah, you are right. Let's not go that path and store an array of
pids. I don't see a strong use-case with the pid of the process
requesting checkpoint. If required, we can add it later once the
pg_stat_progress_checkpoint view gets in.

I don't think that's really necessary to give the pid list.

If you requested a new checkpoint, it doesn't matter if it's only your backend
that triggered it, another backend or a few other dozen, the result will be the
same and you have the information that the request has been seen. We could
store just a bool for that but having a number instead also gives a bit more
information and may allow you to detect some broken logic on your client code
if it keeps increasing.

It's a good metric to show in the view but the information is not
readily available. Additional code is required to calculate the number
of requests. Is it worth doing that? I feel this can be added later if
required.

Please find the v4 patch attached and share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Tue, Mar 1, 2022 at 2:27 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Show quoted text

3) Why do we need this extra calculation for start_lsn?
Do you ever see a negative LSN or something here?
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,
Yes: LSN can take up all of an uint64; whereas the pgstat column is a
bigint type; thus the signed int64. This cast is OK as it wraps
around, but that means we have to take care to correctly display the
LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
the special-casing for negative values.
Yes. The extra calculation is required here as we are storing unit64
value in the variable of type int64. When we convert uint64 to int64
then the bit pattern is preserved (so no data is lost). The high-order
bit becomes the sign bit and if the sign bit is set, both the sign and
magnitude of the value changes. To safely get the actual uint64 value
whatever was assigned, we need the above calculations.

4) Can't you use timestamptz_in(to_char(s.param4)) instead of
pg_stat_get_progress_checkpoint_start_time? I don't quite understand
the reasoning for having this function and it's named as *checkpoint*
when it doesn't do anything specific to the checkpoint at all?

I hadn't thought of using the types' inout functions, but it looks
like timestamp IO functions use a formatted timestring, which won't
work with the epoch-based timestamp stored in the view.

There is a variation of to_timestamp() which takes UNIX epoch (float8)
as an argument and converts it to timestamptz but we cannot directly
call this function with S.param4.

TimestampTz
GetCurrentTimestamp(void)
{
TimestampTz result;
struct timeval tp;

gettimeofday(&tp, NULL);

result = (TimestampTz) tp.tv_sec -
((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
result = (result * USECS_PER_SEC) + tp.tv_usec;

return result;
}

S.param4 contains the output of the above function
(GetCurrentTimestamp()) which returns Postgres epoch but the
to_timestamp() expects UNIX epoch as input. So some calculation is
required here. I feel the SQL 'to_timestamp(946684800 +
(S.param4::float / 1000000)) AS start_time' works fine. The value
'946684800' is equal to ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) *
SECS_PER_DAY). I am not sure whether it is good practice to use this
way. Kindly share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Mon, Feb 28, 2022 at 6:40 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Sun, 27 Feb 2022 at 16:14, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
3) Why do we need this extra calculation for start_lsn?
Do you ever see a negative LSN or something here?
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,
Yes: LSN can take up all of an uint64; whereas the pgstat column is a
bigint type; thus the signed int64. This cast is OK as it wraps
around, but that means we have to take care to correctly display the
LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
the special-casing for negative values.

As to whether it is reasonable: Generating 16GB of wal every second
(2^34 bytes /sec) is probably not impossible (cpu <> memory bandwidth
has been > 20GB/sec for a while); and that leaves you 2^29 seconds of
database runtime; or about 17 years. Seeing that a cluster can be
`pg_upgrade`d (which doesn't reset cluster LSN) since PG 9.0 from at
least version PG 8.4.0 (2009) (and through pg_migrator, from 8.3.0)),
we can assume that clusters hitting LSN=2^63 will be a reasonable
possibility within the next few years. As the lifespan of a PG release
is about 5 years, it doesn't seem impossible that there will be actual
clusters that are going to hit this naturally in the lifespan of PG15.

It is also possible that someone fat-fingers pg_resetwal; and creates
a cluster with LSN >= 2^63; resulting in negative values in the
s.param3 field. Not likely, but we can force such situations; and as
such we should handle that gracefully.

4) Can't you use timestamptz_in(to_char(s.param4)) instead of
pg_stat_get_progress_checkpoint_start_time? I don't quite understand
the reasoning for having this function and it's named as *checkpoint*
when it doesn't do anything specific to the checkpoint at all?

I hadn't thought of using the types' inout functions, but it looks
like timestamp IO functions use a formatted timestring, which won't
work with the epoch-based timestamp stored in the view.

Having 3 unnecessary functions that aren't useful to the users at all
in proc.dat will simply eatup the function oids IMO. Hence, I suggest
let's try to do without extra functions.

I agree that (1) could be simplified, or at least fully expressed in
SQL without exposing too many internals. If we're fine with exposing
internals like flags and type layouts, then (2), and arguably (4), can
be expressed in SQL as well.

-Matthias

Attachments:

v4-0001-pg_stat_progress_checkpoint-view.patchapplication/octet-stream; name=v4-0001-pg_stat_progress_checkpoint-view.patchDownload

From a3ff5c7b194d1128ab0272e546913baeafc04a0c Mon Sep 17 00:00:00 2001
From: Nitin Jadhav <nitinjadhav@microsoft.com>
Date: Wed, 2 Mar 2022 11:06:57 +0000
Subject: [PATCH] pg_stat_progress_checkpoint view

---
 doc/src/sgml/monitoring.sgml         | 384 +++++++++++++++++++++++++++
 doc/src/sgml/ref/checkpoint.sgml     |   6 +
 doc/src/sgml/wal.sgml                |   5 +-
 src/backend/access/transam/xlog.c    |  98 +++++++
 src/backend/catalog/system_views.sql |  50 ++++
 src/backend/storage/buffer/bufmgr.c  |   7 +
 src/backend/storage/sync/sync.c      |   6 +
 src/backend/utils/adt/pgstatfuncs.c  |   2 +
 src/include/commands/progress.h      |  36 +++
 src/include/utils/backend_progress.h |   3 +-
 src/test/regress/expected/rules.out  |  76 ++++++
 11 files changed, 671 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9fb62fec8e..8d7c8ffc92 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -401,6 +401,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
        See <xref linkend='copy-progress-reporting'/>.
       </entry>
      </row>
+
+     <row>
+      <entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
+      <entry>One row only, showing the progress of the checkpoint.
+       See <xref linkend='checkpoint-progress-reporting'/>.
+      </entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
@@ -6844,6 +6851,383 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   </table>
  </sect2>
 
+ <sect2 id="checkpoint-progress-reporting">
+  <title>Checkpoint Progress Reporting</title>
+
+  <indexterm>
+   <primary>pg_stat_progress_checkpoint</primary>
+  </indexterm>
+
+  <para>
+   Whenever the checkpoint operation is running, the
+   <structname>pg_stat_progress_checkpoint</structname> view will contain a
+   single row indicating the progress of the checkpoint. The tables below
+   describe the information that will be reported and provide information about
+   how to interpret it.
+  </para>
+
+  <table id="pg-stat-progress-checkpoint-view" xreflabel="pg_stat_progress_checkpoint">
+   <title><structname>pg_stat_progress_checkpoint</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>pid</structfield> <type>integer</type>
+      </para>
+      <para>
+       Process ID of the checkpointer process.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of checkpoint. See <xref linkend="checkpoint-types"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>kind</structfield> <type>text</type>
+      </para>
+      <para>
+       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>start_lsn</structfield> <type>text</type>
+      </para>
+      <para>
+       The checkpoint start location.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>start_time</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Elapsed time of the checkpoint.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>phase</structfield> <type>text</type>
+      </para>
+      <para>
+       Current processing phase. See <xref linkend="checkpoint-phases"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_total</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of buffers to be written. This is estimated and reported
+       as of the beginning of buffer write operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_processed</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers processed. This counter increases when the targeted
+       buffer is processed. This number will eventually become equal to
+       <literal>buffers_total</literal> when the checkpoint is
+       complete.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers written. This counter only advances when the targeted
+       buffers is written. Note that some of the buffers are processed but may
+       not required to be written. So this count will always be  less than or
+       equal to  <literal>buffers_total</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_total</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of files to be synced. This is estimated and reported as of
+       the beginning of sync operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of files synced. This counter advances when the targeted file is
+       synced. This number will eventually become equal to
+       <literal>files_total</literal>  when the checkpoint is complete.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-types">
+   <title>Checkpoint types</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Types</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>checkpoint</literal></entry>
+      <entry>
+       The current operation is checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>restartpoint</literal></entry>
+      <entry>
+       The current operation is restartpoint.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-kinds">
+   <title>Checkpoint kinds</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Kinds</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>shutdown</literal></entry>
+      <entry>
+       The checkpoint is for shutdown.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>end-of-recovery</literal></entry>
+      <entry>
+       The checkpoint is for end-of-recovery.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>immediate</literal></entry>
+      <entry>
+       The checkpoint is happens without delays.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>force</literal></entry>
+      <entry>
+       The checkpoint is started because some operation forced a checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>flush all</literal></entry>
+      <entry>
+       The checkpoint flushes all pages, including those belonging to unlogged
+       tables.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>wait</literal></entry>
+      <entry>
+       Wait for completion before returning.
+      </entry>
+     </row>
+      <row>
+      <entry><literal>requested</literal></entry>
+      <entry>
+       The checkpoint request has been made.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>wal</literal></entry>
+      <entry>
+       The checkpoint is started because <literal>max_wal_size</literal> is
+       reached.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>time</literal></entry>
+      <entry>
+       The checkpoint is started because <literal>checkpoint_timeout</literal>
+       expired.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-phases">
+   <title>Checkpoint phases</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Phase</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>initializing</literal></entry>
+      <entry>
+       The checkpointer process is preparing to begin the checkpoint operation.
+       This phase is expected to be very brief.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>getting virtual transaction IDs</literal></entry>
+      <entry>
+       The checkpointer process is getting the virtual transaction IDs that
+       are delaying the checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing replication slots</literal></entry>
+      <entry>
+       The checkpointer process is currently flushing all the replication slots
+       to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing logical replication snapshot files</literal></entry>
+      <entry>
+       The checkpointer process is currently removing all the serialized
+       snapshot files that are not required anymore.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing logical rewrite mapping files</literal></entry>
+      <entry>
+       The checkpointer process is currently removing/flushing the logical
+       rewrite mapping files.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing commit log pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing commit log pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing commit time stamp pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing commit time stamp pages to
+       disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing subtransaction pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing subtransaction pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing multixact pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing multixact pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing predicate lock pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing predicate lock pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing buffers</literal></entry>
+      <entry>
+       The checkpointer process is currently writing buffers to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>processing file sync requests</literal></entry>
+      <entry>
+       The checkpointer process is currently processing file sync requests.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>performing two phase checkpoint</literal></entry>
+      <entry>
+       The checkpointer process is currently performing two phase checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>performing post checkpoint cleanup</literal></entry>
+      <entry>
+       The checkpointer process is currently performing post checkpoint cleanup.
+       It removes any lingering files that can be safely removed.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>invalidating replication slots</literal></entry>
+      <entry>
+       The checkpointer process is currently invalidating replication slots.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>recycling old XLOG files</literal></entry>
+      <entry>
+       The checkpointer process is currently recycling old XLOG files.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>truncating subtransactions</literal></entry>
+      <entry>
+       The checkpointer process is currently removing the subtransaction
+       segments.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>finalizing</literal></entry>
+      <entry>
+       The checkpointer process is finalizing the checkpoint operation. 
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  </sect1>
 
  <sect1 id="dynamic-trace">
diff --git a/doc/src/sgml/ref/checkpoint.sgml b/doc/src/sgml/ref/checkpoint.sgml
index 1cebc03d15..a88c76533a 100644
--- a/doc/src/sgml/ref/checkpoint.sgml
+++ b/doc/src/sgml/ref/checkpoint.sgml
@@ -56,6 +56,12 @@ CHECKPOINT
    the <link linkend="predefined-roles-table"><literal>pg_checkpointer</literal></link>
    role can call <command>CHECKPOINT</command>.
   </para>
+
+  <para>
+    The checkpointer process running the checkpoint will report its progress
+    in the <structname>pg_stat_progress_checkpoint</structname> view. See
+    <xref linkend="checkpoint-progress-reporting"/> for details.
+  </para>
  </refsect1>
 
  <refsect1>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 2bb27a8468..a75d1d63d0 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -530,7 +530,10 @@
    adjust the <xref linkend="guc-archive-timeout"/> parameter rather than the
    checkpoint parameters.)
    It is also possible to force a checkpoint by using the SQL
-   command <command>CHECKPOINT</command>.
+   command <command>CHECKPOINT</command>. The checkpointer process running the 
+   checkpoint will report its progress in the
+   <structname>pg_stat_progress_checkpoint</structname> view. See
+   <xref linkend="checkpoint-progress-reporting"/> for details.
   </para>
 
   <para>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2bd7a357..9591f2cecc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -65,6 +65,7 @@
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
+#include "commands/progress.h"
 #include "common/controldata_utils.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
@@ -719,6 +720,8 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static void checkpoint_progress_start(int flags, int type);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -6296,6 +6299,9 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	/* Prepare to report progress of the checkpoint. */
+	checkpoint_progress_start(flags, PROGRESS_CHECKPOINT_TYPE_CHECKPOINT);
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -6394,6 +6400,7 @@ CreateCheckPoint(int flags)
 			curInsert += SizeOfXLogShortPHD;
 	}
 	checkPoint.redo = curInsert;
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_LSN, checkPoint.redo);
 
 	/*
 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
@@ -6501,6 +6508,8 @@ CreateCheckPoint(int flags)
 	 * and we will correctly flush the update below.  So we cannot miss any
 	 * xacts we need to wait for.
 	 */
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_GET_VIRTUAL_TRANSACTION_IDS);
 	vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
 	if (nvxids > 0)
 	{
@@ -6604,6 +6613,8 @@ CreateCheckPoint(int flags)
 	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_POST_CHECKPOINT_CLEANUP);
 	SyncPostCheckpoint();
 
 	/*
@@ -6619,6 +6630,9 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_INVALIDATE_REPLI_SLOTS);
+
 	if (InvalidateObsoleteReplicationSlots(_logSegNo))
 	{
 		/*
@@ -6629,6 +6643,8 @@ CreateCheckPoint(int flags)
 		KeepLogSeg(recptr, &_logSegNo);
 	}
 	_logSegNo--;
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_OLD_XLOG_RECYCLE);
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr,
 					   checkPoint.ThisTimeLineID);
 
@@ -6647,11 +6663,21 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
+	{
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_TRUNCATE_SUBTRANS);
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+	}
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_FINALIZE);
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(false);
 
+	/* Stop reporting progress of the checkpoint. */
+	pgstat_progress_end_command();
+
 	/* Reset the process title */
 	update_checkpoint_display(flags, false, true);
 
@@ -6808,29 +6834,60 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS);
 	CheckPointReplicationSlots();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS);
 	CheckPointSnapBuild();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS);
 	CheckPointLogicalRewriteHeap();
 	CheckPointReplicationOrigin();
 
 	/* Write out all dirty data in SLRUs and the main buffer pool */
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES);
 	CheckPointCLOG();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES);
 	CheckPointCommitTs();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES);
 	CheckPointSUBTRANS();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES);
 	CheckPointMultiXact();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES);
 	CheckPointPredicate();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_BUFFERS);
 	CheckPointBuffers(flags);
 
 	/* Perform all queued up fsyncs */
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_FILE_SYNC);
 	ProcessSyncRequests();
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 
 	/* We deliberately delay 2PC checkpointing as long as possible */
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_TWO_PHASE);
 	CheckPointTwoPhase(checkPointRedo);
 }
 
@@ -6977,6 +7034,9 @@ CreateRestartPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	/* Prepare to report progress of the restartpoint. */
+	checkpoint_progress_start(flags, PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT);
+
 	if (log_checkpoints)
 		LogCheckpointStart(flags, true);
 
@@ -7051,6 +7111,9 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_INVALIDATE_REPLI_SLOTS);
+
 	if (InvalidateObsoleteReplicationSlots(_logSegNo))
 	{
 		/*
@@ -7077,6 +7140,8 @@ CreateRestartPoint(int flags)
 	if (!RecoveryInProgress())
 		replayTLI = XLogCtl->InsertTimeLineID;
 
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_OLD_XLOG_RECYCLE);
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr, replayTLI);
 
 	/*
@@ -7093,11 +7158,20 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
+	{
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_TRUNCATE_SUBTRANS);
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+	}
 
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_FINALIZE);
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(true);
 
+	/* Stop reporting progress of the restartpoint. */
+	pgstat_progress_end_command();
+
 	/* Reset the process title */
 	update_checkpoint_display(flags, true, true);
 
@@ -9197,3 +9271,27 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl->WalWriterSleeping = sleeping;
 	SpinLockRelease(&XLogCtl->info_lck);
 }
+
+/*
+ * Start reporting progress of the checkpoint.
+ */
+static void
+checkpoint_progress_start(int flags, int type)
+{
+	const int	index[] = {
+		PROGRESS_CHECKPOINT_TYPE,
+		PROGRESS_CHECKPOINT_KIND,
+		PROGRESS_CHECKPOINT_PHASE,
+		PROGRESS_CHECKPOINT_START_TIMESTAMP
+		};
+	int64		val[4];
+
+	pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
+
+	val[0] = type;
+	val[1] = flags;
+	val[2] = PROGRESS_CHECKPOINT_PHASE_INIT;
+	val[3] = CheckpointStats.ckpt_start_t;
+
+	pgstat_progress_update_multi_param(4, index, val);
+}
\ No newline at end of file
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 40b7bca5a9..1e5b6e995f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1273,3 +1273,53 @@ CREATE VIEW pg_stat_subscription_stats AS
         ss.stats_reset
     FROM pg_subscription as s,
          pg_stat_get_subscription_stats(s.oid) as ss;
+
+CREATE VIEW pg_stat_progress_checkpoint AS
+    SELECT
+        S.pid AS pid,
+        CASE S.param1 WHEN 1 THEN 'checkpoint'
+                      WHEN 2 THEN 'restartpoint'
+                      END AS type,
+        ( CASE WHEN (S.param2 & 1) > 0 THEN 'shutdown ' ELSE '' END ||
+          CASE WHEN (S.param2 & 2) > 0 THEN 'end-of-recovery ' ELSE '' END ||
+          CASE WHEN (S.param2 & 4) > 0 THEN 'immediate ' ELSE '' END ||
+          CASE WHEN (S.param2 & 8) > 0 THEN 'force ' ELSE '' END ||
+          CASE WHEN (S.param2 & 16) > 0 THEN 'flush-all ' ELSE '' END ||
+          CASE WHEN (S.param2 & 32) > 0 THEN 'wait ' ELSE '' END ||
+          CASE WHEN (S.param2 & 64) > 0 THEN 'requested ' ELSE '' END ||
+          CASE WHEN (S.param2 & 128) > 0 THEN 'wal ' ELSE '' END ||
+          CASE WHEN (S.param2 & 256) > 0 THEN 'time ' ELSE '' END
+        ) AS kind,
+        ( '0/0'::pg_lsn +
+          ((CASE
+                WHEN S.param3 < 0 THEN pow(2::numeric, 64::numeric)::numeric
+                ELSE 0::numeric
+            END) +
+           S.param3::numeric)
+        ) AS start_lsn,
+        to_timestamp(946684800 + (S.param4::float8 / 1000000)) AS start_time,
+        CASE S.param5 WHEN 1 THEN 'initializing'
+                      WHEN 2 THEN 'getting virtual transaction IDs'
+                      WHEN 3 THEN 'checkpointing replication slots'
+                      WHEN 4 THEN 'checkpointing logical replication snapshot files'
+                      WHEN 5 THEN 'checkpointing logical rewrite mapping files'
+                      WHEN 6 THEN 'checkpointing commit log pages'
+                      WHEN 7 THEN 'checkpointing commit time stamp pages'
+                      WHEN 8 THEN 'checkpointing subtransaction pages'
+                      WHEN 9 THEN 'checkpointing multixact pages'
+                      WHEN 10 THEN 'checkpointing predicate lock pages'
+                      WHEN 11 THEN 'checkpointing buffers'
+                      WHEN 12 THEN 'processing file sync requests'
+                      WHEN 13 THEN 'performing two phase checkpoint'
+                      WHEN 14 THEN 'performing post checkpoint cleanup'
+                      WHEN 15 THEN 'invalidating replication slots'
+                      WHEN 16 THEN 'recycling old XLOG files'
+                      WHEN 17 THEN 'truncating subtransactions'
+                      WHEN 18 THEN 'finalizing'
+                      END AS phase,
+        S.param6 AS buffers_total,
+        S.param7 AS buffers_processed,
+        S.param8 AS buffers_written,
+        S.param9 AS files_total,
+        S.param10 AS files_synced
+    FROM pg_stat_get_progress_info('CHECKPOINT') AS S;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f8..9663035d7a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "commands/progress.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -2012,6 +2013,8 @@ BufferSync(int flags)
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_TOTAL,
+								 num_to_scan);
 
 	/*
 	 * Sort buffers that need to be written to reduce the likelihood of random
@@ -2129,6 +2132,8 @@ BufferSync(int flags)
 		bufHdr = GetBufferDescriptor(buf_id);
 
 		num_processed++;
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_PROCESSED,
+									 num_processed);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -2149,6 +2154,8 @@ BufferSync(int flags)
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
+				pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_WRITTEN,
+											 num_written);
 			}
 		}
 
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e161d57761..638d3eb781 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -23,6 +23,7 @@
 #include "access/multixact.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "commands/progress.h"
 #include "commands/tablespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -356,6 +357,9 @@ ProcessSyncRequests(void)
 	/* Now scan the hashtable for fsync requests to process */
 	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOps);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_FILES_TOTAL,
+								 hash_get_num_entries(pendingOps));
+
 	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
 	{
 		int			failures;
@@ -419,6 +423,8 @@ ProcessSyncRequests(void)
 						longest = elapsed;
 					total_elapsed += elapsed;
 					processed++;
+					pgstat_progress_update_param(PROGRESS_CHECKPOINT_FILES_SYNCED,
+												 processed);
 
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index fd993d0d5f..95df730415 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -494,6 +494,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 		cmdtype = PROGRESS_COMMAND_BASEBACKUP;
 	else if (pg_strcasecmp(cmd, "COPY") == 0)
 		cmdtype = PROGRESS_COMMAND_COPY;
+	 else if (pg_strcasecmp(cmd, "CHECKPOINT") == 0)
+		cmdtype = PROGRESS_COMMAND_CHECKPOINT;
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index a28938caf4..7064026bf1 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -151,4 +151,40 @@
 #define PROGRESS_COPY_TYPE_PIPE 3
 #define PROGRESS_COPY_TYPE_CALLBACK 4
 
+/* Progress parameters for checkpoint */
+#define PROGRESS_CHECKPOINT_TYPE                    0
+#define PROGRESS_CHECKPOINT_KIND                    1
+#define PROGRESS_CHECKPOINT_LSN                     2
+#define PROGRESS_CHECKPOINT_START_TIMESTAMP         3
+#define PROGRESS_CHECKPOINT_PHASE                   4
+#define PROGRESS_CHECKPOINT_BUFFERS_TOTAL           5
+#define PROGRESS_CHECKPOINT_BUFFERS_PROCESSED       6
+#define PROGRESS_CHECKPOINT_BUFFERS_WRITTEN         7
+#define PROGRESS_CHECKPOINT_FILES_TOTAL             8
+#define PROGRESS_CHECKPOINT_FILES_SYNCED            9
+
+/* Types of checkpoint (as advertised via PROGRESS_CHECKPOINT_TYPE) */
+#define PROGRESS_CHECKPOINT_TYPE_CHECKPOINT         1
+#define PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT       2
+
+/* Phases of checkpoint (as advertised via PROGRESS_CHECKPOINT_PHASE) */
+#define PROGRESS_CHECKPOINT_PHASE_INIT                          1
+#define PROGRESS_CHECKPOINT_PHASE_GET_VIRTUAL_TRANSACTION_IDS   2
+#define PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS                   3
+#define PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS                     4
+#define PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS      5
+#define PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES                    6
+#define PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES                7
+#define PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES                8
+#define PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES               9
+#define PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES          10
+#define PROGRESS_CHECKPOINT_PHASE_BUFFERS                       11
+#define PROGRESS_CHECKPOINT_PHASE_FILE_SYNC                     12
+#define PROGRESS_CHECKPOINT_PHASE_TWO_PHASE                     13
+#define PROGRESS_CHECKPOINT_PHASE_POST_CHECKPOINT_CLEANUP       14
+#define PROGRESS_CHECKPOINT_PHASE_INVALIDATE_REPLI_SLOTS        15
+#define PROGRESS_CHECKPOINT_PHASE_OLD_XLOG_RECYCLE              16
+#define PROGRESS_CHECKPOINT_PHASE_TRUNCATE_SUBTRANS             17
+#define PROGRESS_CHECKPOINT_PHASE_FINALIZE                      18
+
 #endif
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 47bf8029b0..02d51fb948 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -27,7 +27,8 @@ typedef enum ProgressCommandType
 	PROGRESS_COMMAND_CLUSTER,
 	PROGRESS_COMMAND_CREATE_INDEX,
 	PROGRESS_COMMAND_BASEBACKUP,
-	PROGRESS_COMMAND_COPY
+	PROGRESS_COMMAND_COPY,
+	PROGRESS_COMMAND_CHECKPOINT
 } ProgressCommandType;
 
 #define PGSTAT_NUM_PROGRESS_PARAM	20
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac468568a1..11626a4200 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1897,6 +1897,82 @@ pg_stat_progress_basebackup| SELECT s.pid,
     s.param4 AS tablespaces_total,
     s.param5 AS tablespaces_streamed
    FROM pg_stat_get_progress_info('BASEBACKUP'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
+pg_stat_progress_checkpoint| SELECT s.pid,
+        CASE s.param1
+            WHEN 1 THEN 'checkpoint'::text
+            WHEN 2 THEN 'restartpoint'::text
+            ELSE NULL::text
+        END AS type,
+    ((((((((
+        CASE
+            WHEN ((s.param2 & (1)::bigint) > 0) THEN 'shutdown '::text
+            ELSE ''::text
+        END ||
+        CASE
+            WHEN ((s.param2 & (2)::bigint) > 0) THEN 'end-of-recovery '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (4)::bigint) > 0) THEN 'immediate '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (8)::bigint) > 0) THEN 'force '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (16)::bigint) > 0) THEN 'flush-all '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (32)::bigint) > 0) THEN 'wait '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (64)::bigint) > 0) THEN 'requested '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (128)::bigint) > 0) THEN 'wal '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (256)::bigint) > 0) THEN 'time '::text
+            ELSE ''::text
+        END) AS kind,
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,
+    to_timestamp(((946684800)::double precision + ((s.param4)::double precision / (1000000)::double precision))) AS start_time,
+        CASE s.param5
+            WHEN 1 THEN 'initializing'::text
+            WHEN 2 THEN 'getting virtual transaction IDs'::text
+            WHEN 3 THEN 'checkpointing replication slots'::text
+            WHEN 4 THEN 'checkpointing logical replication snapshot files'::text
+            WHEN 5 THEN 'checkpointing logical rewrite mapping files'::text
+            WHEN 6 THEN 'checkpointing commit log pages'::text
+            WHEN 7 THEN 'checkpointing commit time stamp pages'::text
+            WHEN 8 THEN 'checkpointing subtransaction pages'::text
+            WHEN 9 THEN 'checkpointing multixact pages'::text
+            WHEN 10 THEN 'checkpointing predicate lock pages'::text
+            WHEN 11 THEN 'checkpointing buffers'::text
+            WHEN 12 THEN 'processing file sync requests'::text
+            WHEN 13 THEN 'performing two phase checkpoint'::text
+            WHEN 14 THEN 'performing post checkpoint cleanup'::text
+            WHEN 15 THEN 'invalidating replication slots'::text
+            WHEN 16 THEN 'recycling old XLOG files'::text
+            WHEN 17 THEN 'truncating subtransactions'::text
+            WHEN 18 THEN 'finalizing'::text
+            ELSE NULL::text
+        END AS phase,
+    s.param6 AS buffers_total,
+    s.param7 AS buffers_processed,
+    s.param8 AS buffers_written,
+    s.param9 AS files_total,
+    s.param10 AS files_synced
+   FROM pg_stat_get_progress_info('CHECKPOINT'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
 pg_stat_progress_cluster| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.25.1

#51

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#50)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Also, how about special phases for SyncPostCheckpoint(),
SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
it might be increase in future (?)), TruncateSUBTRANS()?

SyncPreCheckpoint() is just incrementing a counter and
PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
there is no need to add any phases for these as of now. We can add in
the future if necessary. Added phases for SyncPostCheckpoint(),
InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().

Okay.

10) I'm not sure if it's discussed, how about adding the number of
snapshot/mapping files so far the checkpoint has processed in file
processing while loops of
CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
be many logical snapshot or mapping files and users may be interested
in knowing the so-far-processed-file-count.

I had thought about this while sharing the v1 patch and mentioned my
views upthread. I feel it won't give meaningful progress information
(It can be treated as statistics). Hence not included. Thoughts?

Okay. If there are any complaints about it we can always add them later.

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.

Yeah, you are right. Let's not go that path and store an array of
pids. I don't see a strong use-case with the pid of the process
requesting checkpoint. If required, we can add it later once the
pg_stat_progress_checkpoint view gets in.

I don't think that's really necessary to give the pid list.

If you requested a new checkpoint, it doesn't matter if it's only your backend
that triggered it, another backend or a few other dozen, the result will be the
same and you have the information that the request has been seen. We could
store just a bool for that but having a number instead also gives a bit more
information and may allow you to detect some broken logic on your client code
if it keeps increasing.

It's a good metric to show in the view but the information is not
readily available. Additional code is required to calculate the number
of requests. Is it worth doing that? I feel this can be added later if
required.

Yes, we can always add it later if required.

Please find the v4 patch attached and share your thoughts.

I reviewed v4 patch, here are my comments:

1) Can we convert below into pgstat_progress_update_multi_param, just
to avoid function calls?
pgstat_progress_update_param(PROGRESS_CHECKPOINT_LSN, checkPoint.redo);
pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,

2) Why are we not having special phase for CheckPointReplicationOrigin
as it does good bunch of work (writing to disk, XLogFlush,
durable_rename) especially when max_replication_slots is large?

3) I don't think "requested" is necessary here as it doesn't add any
value or it's not a checkpoint kind or such, you can remove it.

4) s:/'recycling old XLOG files'/'recycling old WAL files'
+ WHEN 16 THEN 'recycling old XLOG files'

5) Can we place CREATE VIEW pg_stat_progress_checkpoint AS definition
next to pg_stat_progress_copy in system_view.sql? It looks like all
the progress reporting views are next to each other.

6) How about shutdown and end-of-recovery checkpoint? Are you planning
to have an ereport_startup_progress mechanism as 0002?

7) I think you don't need to call checkpoint_progress_start and
pgstat_progress_update_param, any other progress reporting function
for shutdown and end-of-recovery checkpoint right?

8) Not for all kinds of checkpoints right? pg_stat_progress_checkpoint
can't show progress report for shutdown and end-of-recovery
checkpoint, I think you need to specify that here in wal.sgml and
checkpoint.sgml.
+   command <command>CHECKPOINT</command>. The checkpointer process running the
+   checkpoint will report its progress in the
+   <structname>pg_stat_progress_checkpoint</structname> view. See
+   <xref linkend="checkpoint-progress-reporting"/> for details.

9) Can you add a test case for pg_stat_progress_checkpoint view? I
think it's good to add one. See, below for reference:
-- Add a trigger to catch and print the contents of the catalog view
-- pg_stat_progress_copy during data insertion. This allows to test
-- the validation of some progress reports for COPY FROM where the trigger
-- would fire.
create function notice_after_tab_progress_reporting() returns trigger AS
$$
declare report record;

10) Typo: it's not "is happens"
+ The checkpoint is happens without delays.

11) Can you be specific what are those "some operations" that forced a
checkpoint? May be like, basebackup, createdb or something?
+ The checkpoint is started because some operation forced a checkpoint.

12) Can you be a bit elobartive here who waits? Something like the
backend that requested checkpoint will wait until it's completion ....
+ Wait for completion before returning.

13) "removing unneeded or flushing needed logical rewrite mapping files"
+ The checkpointer process is currently removing/flushing the logical

14) "old WAL files"
+ The checkpointer process is currently recycling old XLOG files.

Regards,
Bharath Rupireddy.

#52

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#50)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Here are some of my review comments on the latest patch:

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of checkpoint. See <xref linkend="checkpoint-types"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>kind</structfield> <type>text</type>
+      </para>
+      <para>
+       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
+      </para></entry>
+     </row>

This looks a bit confusing. Two columns, one with the name "checkpoint
types" and another "checkpoint kinds". You can probably rename
checkpoint-kinds to checkpoint-flags and let the checkpoint-types be
as-it-is.

+      <entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
+      <entry>One row only, showing the progress of the checkpoint.

Let's make this message consistent with the already existing message
for pg_stat_wal_receiver. See description for pg_stat_wal_receiver
view in "Dynamic Statistics Views" table.

I think the output in the kind column can be displayed as {immediate,
force, wait, requested, time}. By the way these are all checkpoint
flags so it is better to display it as checkpoint flags instead of
checkpoint kind as mentioned in one of my previous comments.

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time
start_lsn | 0/14C60F8
start_time | 2022-03-03 18:59:56.018662+05:30
phase | performing two phase checkpoint

This is the output I see when the checkpointer process has come out of
the two phase checkpoint and is currently writing checkpoint xlog
records and doing other stuff like updating control files etc. Is this
okay?

The output of log_checkpoint shows the number of buffers written is 3
whereas the output of pg_stat_progress_checkpoint shows it as 0. See
below:

2022-03-03 20:04:45.643 IST [22043] LOG: checkpoint complete: wrote 3
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB

Any idea why this mismatch?

I think we can add a couple of more information to this view -
start_time for buffer write operation and start_time for buffer sync
operation. These are two very time consuming tasks in a checkpoint and
people would find it useful to know how much time is being taken by
the checkpoint in I/O operation phase. thoughts?

--
With Regards,
Ashutosh Sharma.

On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Show quoted text

Thanks for reviewing.

I suggested upthread to store the starting timeline instead. This way you can
deduce whether it's a restartpoint or a checkpoint, but you can also deduce
other information, like what was the starting WAL.

I don't understand why we need the timeline here to just determine
whether it's a restartpoint or checkpoint.

I'm not saying it's necessary, I'm saying that for the same space usage we can
store something a bit more useful. If no one cares about having the starting
timeline available for no extra cost then sure, let's just store the kind
directly.

Fixed.
2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
directly instead of new function pg_stat_get_progress_checkpoint_kind?
+ snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
+ (flags == 0) ? "unknown" : "",
+ (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
+ (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
+ (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
+ (flags & CHECKPOINT_FORCE) ? "force " : "",
+ (flags & CHECKPOINT_WAIT) ? "wait " : "",
+ (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
+ (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
+ (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");
Fixed.
---

5) Do we need a special phase for this checkpoint operation? I'm not
sure in which cases it will take a long time, but it looks like
there's a wait loop here.
vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
if (nvxids > 0)
{
do
{
pg_usleep(10000L); /* wait for 10 msec */
} while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
}

Yes. It is better to add a separate phase here.
---

Also, how about special phases for SyncPostCheckpoint(),
SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
it might be increase in future (?)), TruncateSUBTRANS()?

SyncPreCheckpoint() is just incrementing a counter and
PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
there is no need to add any phases for these as of now. We can add in
the future if necessary. Added phases for SyncPostCheckpoint(),
InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().
---
6) SLRU (Simple LRU) isn't a phase here, you can just say
PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES.
+
+ pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+ PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
CheckPointPredicate();
And :s/checkpointing SLRU pages/checkpointing predicate lock pages
+ WHEN 9 THEN 'checkpointing SLRU pages'
Fixed.
---

7) :s/PROGRESS_CHECKPOINT_PHASE_FILE_SYNC/PROGRESS_CHECKPOINT_PHASE_PROCESS_FILE_SYNC_REQUESTS

I feel PROGRESS_CHECKPOINT_PHASE_FILE_SYNC is a better option here as
it describes the purpose in less words.

And :s/WHEN 11 THEN 'performing sync requests'/WHEN 11 THEN
'processing file sync requests'

Fixed.
---

8) :s/Finalizing/finalizing
+ WHEN 14 THEN 'Finalizing'

Fixed.
---
9) :s/checkpointing snapshots/checkpointing logical replication snapshot files
+                      WHEN 3 THEN 'checkpointing snapshots'
:s/checkpointing logical rewrite mappings/checkpointing logical
replication rewrite mapping files
+                      WHEN 4 THEN 'checkpointing logical rewrite mappings'
Fixed.
---

10) I'm not sure if it's discussed, how about adding the number of
snapshot/mapping files so far the checkpoint has processed in file
processing while loops of
CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
be many logical snapshot or mapping files and users may be interested
in knowing the so-far-processed-file-count.

I had thought about this while sharing the v1 patch and mentioned my
views upthread. I feel it won't give meaningful progress information
(It can be treated as statistics). Hence not included. Thoughts?

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.

Yeah, you are right. Let's not go that path and store an array of
pids. I don't see a strong use-case with the pid of the process
requesting checkpoint. If required, we can add it later once the
pg_stat_progress_checkpoint view gets in.

I don't think that's really necessary to give the pid list.

If you requested a new checkpoint, it doesn't matter if it's only your backend
that triggered it, another backend or a few other dozen, the result will be the
same and you have the information that the request has been seen. We could
store just a bool for that but having a number instead also gives a bit more
information and may allow you to detect some broken logic on your client code
if it keeps increasing.

It's a good metric to show in the view but the information is not
readily available. Additional code is required to calculate the number
of requests. Is it worth doing that? I feel this can be added later if
required.

Please find the v4 patch attached and share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Tue, Mar 1, 2022 at 2:27 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
3) Why do we need this extra calculation for start_lsn?
Do you ever see a negative LSN or something here?
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,
Yes: LSN can take up all of an uint64; whereas the pgstat column is a
bigint type; thus the signed int64. This cast is OK as it wraps
around, but that means we have to take care to correctly display the
LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
the special-casing for negative values.
Yes. The extra calculation is required here as we are storing unit64
value in the variable of type int64. When we convert uint64 to int64
then the bit pattern is preserved (so no data is lost). The high-order
bit becomes the sign bit and if the sign bit is set, both the sign and
magnitude of the value changes. To safely get the actual uint64 value
whatever was assigned, we need the above calculations.

4) Can't you use timestamptz_in(to_char(s.param4)) instead of
pg_stat_get_progress_checkpoint_start_time? I don't quite understand
the reasoning for having this function and it's named as *checkpoint*
when it doesn't do anything specific to the checkpoint at all?

I hadn't thought of using the types' inout functions, but it looks
like timestamp IO functions use a formatted timestring, which won't
work with the epoch-based timestamp stored in the view.

There is a variation of to_timestamp() which takes UNIX epoch (float8)
as an argument and converts it to timestamptz but we cannot directly
call this function with S.param4.

TimestampTz
GetCurrentTimestamp(void)
{
TimestampTz result;
struct timeval tp;

gettimeofday(&tp, NULL);

result = (TimestampTz) tp.tv_sec -
((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
result = (result * USECS_PER_SEC) + tp.tv_usec;

return result;
}

S.param4 contains the output of the above function
(GetCurrentTimestamp()) which returns Postgres epoch but the
to_timestamp() expects UNIX epoch as input. So some calculation is
required here. I feel the SQL 'to_timestamp(946684800 +
(S.param4::float / 1000000)) AS start_time' works fine. The value
'946684800' is equal to ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) *
SECS_PER_DAY). I am not sure whether it is good practice to use this
way. Kindly share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Mon, Feb 28, 2022 at 6:40 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Sun, 27 Feb 2022 at 16:14, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
3) Why do we need this extra calculation for start_lsn?
Do you ever see a negative LSN or something here?
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,
Yes: LSN can take up all of an uint64; whereas the pgstat column is a
bigint type; thus the signed int64. This cast is OK as it wraps
around, but that means we have to take care to correctly display the
LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
the special-casing for negative values.

As to whether it is reasonable: Generating 16GB of wal every second
(2^34 bytes /sec) is probably not impossible (cpu <> memory bandwidth
has been > 20GB/sec for a while); and that leaves you 2^29 seconds of
database runtime; or about 17 years. Seeing that a cluster can be
`pg_upgrade`d (which doesn't reset cluster LSN) since PG 9.0 from at
least version PG 8.4.0 (2009) (and through pg_migrator, from 8.3.0)),
we can assume that clusters hitting LSN=2^63 will be a reasonable
possibility within the next few years. As the lifespan of a PG release
is about 5 years, it doesn't seem impossible that there will be actual
clusters that are going to hit this naturally in the lifespan of PG15.

It is also possible that someone fat-fingers pg_resetwal; and creates
a cluster with LSN >= 2^63; resulting in negative values in the
s.param3 field. Not likely, but we can force such situations; and as
such we should handle that gracefully.

4) Can't you use timestamptz_in(to_char(s.param4)) instead of
pg_stat_get_progress_checkpoint_start_time? I don't quite understand
the reasoning for having this function and it's named as *checkpoint*
when it doesn't do anything specific to the checkpoint at all?

I hadn't thought of using the types' inout functions, but it looks
like timestamp IO functions use a formatted timestring, which won't
work with the epoch-based timestamp stored in the view.

Having 3 unnecessary functions that aren't useful to the users at all
in proc.dat will simply eatup the function oids IMO. Hence, I suggest
let's try to do without extra functions.

I agree that (1) could be simplified, or at least fully expressed in
SQL without exposing too many internals. If we're fine with exposing
internals like flags and type layouts, then (2), and arguably (4), can
be expressed in SQL as well.

-Matthias

#53

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#50)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Wed, Mar 2, 2022 at 7:15 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.

It's a good metric to show in the view but the information is not
readily available. Additional code is required to calculate the number
of requests. Is it worth doing that? I feel this can be added later if
required.

Is it that hard or costly to do? Just sending a message to increment
the stat counter in RequestCheckpoint() would be enough.

Also, unless I'm missing something it's still only showing the initial
checkpoint flags, so it's *not* showing what the checkpoint is really
doing, only what the checkpoint may be doing if nothing else happens.
It just feels wrong. You could even use that ckpt_flags info to know
that at least one backend has requested a new checkpoint, if you don't
want to have a number of backends.

#54

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#53)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Thanks for reviewing.

6) How about shutdown and end-of-recovery checkpoint? Are you planning
to have an ereport_startup_progress mechanism as 0002?

I thought of including it earlier then I felt lets first make the
current patch stable. Once all the fields are properly decided and the
patch gets in then we can easily extend the functionality to shutdown
and end-of-recovery cases. I have also observed that the timer
functionality wont work properly in case of shutdown as we are doing
an immediate checkpoint. So this needs a lot of discussion and I would
like to handle this on a separate thread.
---

7) I think you don't need to call checkpoint_progress_start and
pgstat_progress_update_param, any other progress reporting function
for shutdown and end-of-recovery checkpoint right?

I had included the guards earlier and then removed later based on the
discussion upthread.
---

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time
start_lsn | 0/14C60F8
start_time | 2022-03-03 18:59:56.018662+05:30
phase | performing two phase checkpoint

This is the output I see when the checkpointer process has come out of
the two phase checkpoint and is currently writing checkpoint xlog
records and doing other stuff like updating control files etc. Is this
okay?

The idea behind choosing the phases is based on the functionality
which takes longer time to execute. Since after two phase checkpoint
till post checkpoint cleanup won't take much time to execute, I have
not added any additional phase for that. But I also agree that this
gives wrong information to the user. How about mentioning the phase
information at the end of each phase like "Initializing",
"Initialization done", ..., "two phase checkpoint done", "post
checkpoint cleanup done", .., "finalizing". Except for the first phase
("initializing") and last phase ("finalizing"), all the other phases
describe the end of a certain operation. I feel this gives correct
information even though the phase name/description does not represent
the entire code block between two phases. For example if the current
phase is ''two phase checkpoint done". Then the user can infer that
the checkpointer has done till two phase checkpoint and it is doing
other stuff that are after that. Thoughts?

The output of log_checkpoint shows the number of buffers written is 3
whereas the output of pg_stat_progress_checkpoint shows it as 0. See
below:

2022-03-03 20:04:45.643 IST [22043] LOG: checkpoint complete: wrote 3
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB

--

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time
start_lsn | 0/14C60F8
start_time | 2022-03-03 18:59:56.018662+05:30
phase | finalizing
buffers_total | 0
buffers_processed | 0
buffers_written | 0

Any idea why this mismatch?

Good catch. In BufferSync() we have 'num_to_scan' (buffers_total)
which indicates the total number of buffers to be processed. Based on
that, the 'buffers_processed' and 'buffers_written' counter gets
incremented. I meant these values may reach upto 'buffers_total'. The
current pg_stat_progress_view support above information. There is
another place when 'ckpt_bufs_written' gets incremented (In
SlruInternalWritePage()). This increment is above the 'buffers_total'
value and it is included in the server log message (checkpoint end)
and not included in the view. I am a bit confused here. If we include
this increment in the view then we cannot calculate the exact
'buffers_total' beforehand. Can we increment the 'buffers_toal' also
when 'ckpt_bufs_written' gets incremented so that we can match the
behaviour with checkpoint end message? Please share your thoughts.
---

I think we can add a couple of more information to this view -
start_time for buffer write operation and start_time for buffer sync
operation. These are two very time consuming tasks in a checkpoint and
people would find it useful to know how much time is being taken by
the checkpoint in I/O operation phase. thoughts?

I felt the detailed progress is getting shown for these 2 phases of
the checkpoint like 'buffers_processed', 'buffers_written' and
'files_synced'. Hence I did not think about adding start time and If
it is really required, then I can add.

Is it that hard or costly to do? Just sending a message to increment
the stat counter in RequestCheckpoint() would be enough.

Also, unless I'm missing something it's still only showing the initial
checkpoint flags, so it's *not* showing what the checkpoint is really
doing, only what the checkpoint may be doing if nothing else happens.
It just feels wrong. You could even use that ckpt_flags info to know
that at least one backend has requested a new checkpoint, if you don't
want to have a number of backends.

I think using ckpt_flags to display whether any new requests have been
made or not is a good idea. I will include it in the next patch.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Thu, Mar 3, 2022 at 11:58 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Wed, Mar 2, 2022 at 7:15 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.

It's a good metric to show in the view but the information is not
readily available. Additional code is required to calculate the number
of requests. Is it worth doing that? I feel this can be added later if
required.

Is it that hard or costly to do? Just sending a message to increment
the stat counter in RequestCheckpoint() would be enough.

Also, unless I'm missing something it's still only showing the initial
checkpoint flags, so it's *not* showing what the checkpoint is really
doing, only what the checkpoint may be doing if nothing else happens.
It just feels wrong. You could even use that ckpt_flags info to know
that at least one backend has requested a new checkpoint, if you don't
want to have a number of backends.

#55

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#54)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Please don't mix comments from multiple reviewers into one thread.
It's hard to understand which comments are mine or Julien's or from
others. Can you please respond to the email from each of us separately
with an inline response. That will be helpful to understand your
thoughts on our review comments.

--
With Regards,
Ashutosh Sharma.

On Fri, Mar 4, 2022 at 4:59 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Show quoted text

Thanks for reviewing.

6) How about shutdown and end-of-recovery checkpoint? Are you planning
to have an ereport_startup_progress mechanism as 0002?

I thought of including it earlier then I felt lets first make the
current patch stable. Once all the fields are properly decided and the
patch gets in then we can easily extend the functionality to shutdown
and end-of-recovery cases. I have also observed that the timer
functionality wont work properly in case of shutdown as we are doing
an immediate checkpoint. So this needs a lot of discussion and I would
like to handle this on a separate thread.
---

7) I think you don't need to call checkpoint_progress_start and
pgstat_progress_update_param, any other progress reporting function
for shutdown and end-of-recovery checkpoint right?

I had included the guards earlier and then removed later based on the
discussion upthread.
---

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time
start_lsn | 0/14C60F8
start_time | 2022-03-03 18:59:56.018662+05:30
phase | performing two phase checkpoint

This is the output I see when the checkpointer process has come out of
the two phase checkpoint and is currently writing checkpoint xlog
records and doing other stuff like updating control files etc. Is this
okay?

The idea behind choosing the phases is based on the functionality
which takes longer time to execute. Since after two phase checkpoint
till post checkpoint cleanup won't take much time to execute, I have
not added any additional phase for that. But I also agree that this
gives wrong information to the user. How about mentioning the phase
information at the end of each phase like "Initializing",
"Initialization done", ..., "two phase checkpoint done", "post
checkpoint cleanup done", .., "finalizing". Except for the first phase
("initializing") and last phase ("finalizing"), all the other phases
describe the end of a certain operation. I feel this gives correct
information even though the phase name/description does not represent
the entire code block between two phases. For example if the current
phase is ''two phase checkpoint done". Then the user can infer that
the checkpointer has done till two phase checkpoint and it is doing
other stuff that are after that. Thoughts?

The output of log_checkpoint shows the number of buffers written is 3
whereas the output of pg_stat_progress_checkpoint shows it as 0. See
below:

2022-03-03 20:04:45.643 IST [22043] LOG: checkpoint complete: wrote 3
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB

--

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time
start_lsn | 0/14C60F8
start_time | 2022-03-03 18:59:56.018662+05:30
phase | finalizing
buffers_total | 0
buffers_processed | 0
buffers_written | 0

Any idea why this mismatch?

Good catch. In BufferSync() we have 'num_to_scan' (buffers_total)
which indicates the total number of buffers to be processed. Based on
that, the 'buffers_processed' and 'buffers_written' counter gets
incremented. I meant these values may reach upto 'buffers_total'. The
current pg_stat_progress_view support above information. There is
another place when 'ckpt_bufs_written' gets incremented (In
SlruInternalWritePage()). This increment is above the 'buffers_total'
value and it is included in the server log message (checkpoint end)
and not included in the view. I am a bit confused here. If we include
this increment in the view then we cannot calculate the exact
'buffers_total' beforehand. Can we increment the 'buffers_toal' also
when 'ckpt_bufs_written' gets incremented so that we can match the
behaviour with checkpoint end message? Please share your thoughts.
---

I think we can add a couple of more information to this view -
start_time for buffer write operation and start_time for buffer sync
operation. These are two very time consuming tasks in a checkpoint and
people would find it useful to know how much time is being taken by
the checkpoint in I/O operation phase. thoughts?

I felt the detailed progress is getting shown for these 2 phases of
the checkpoint like 'buffers_processed', 'buffers_written' and
'files_synced'. Hence I did not think about adding start time and If
it is really required, then I can add.

Is it that hard or costly to do? Just sending a message to increment
the stat counter in RequestCheckpoint() would be enough.

Also, unless I'm missing something it's still only showing the initial
checkpoint flags, so it's *not* showing what the checkpoint is really
doing, only what the checkpoint may be doing if nothing else happens.
It just feels wrong. You could even use that ckpt_flags info to know
that at least one backend has requested a new checkpoint, if you don't
want to have a number of backends.

I think using ckpt_flags to display whether any new requests have been
made or not is a good idea. I will include it in the next patch.

Thanks & Regards,
Nitin Jadhav
On Thu, Mar 3, 2022 at 11:58 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Wed, Mar 2, 2022 at 7:15 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.

It's a good metric to show in the view but the information is not
readily available. Additional code is required to calculate the number
of requests. Is it worth doing that? I feel this can be added later if
required.

Is it that hard or costly to do? Just sending a message to increment
the stat counter in RequestCheckpoint() would be enough.

Also, unless I'm missing something it's still only showing the initial
checkpoint flags, so it's *not* showing what the checkpoint is really
doing, only what the checkpoint may be doing if nothing else happens.
It just feels wrong. You could even use that ckpt_flags info to know
that at least one backend has requested a new checkpoint, if you don't
want to have a number of backends.

#56

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Bharath Rupireddy (#51)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

1) Can we convert below into pgstat_progress_update_multi_param, just
to avoid function calls?
pgstat_progress_update_param(PROGRESS_CHECKPOINT_LSN, checkPoint.redo);
pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,

2) Why are we not having special phase for CheckPointReplicationOrigin
as it does good bunch of work (writing to disk, XLogFlush,
durable_rename) especially when max_replication_slots is large?

3) I don't think "requested" is necessary here as it doesn't add any
value or it's not a checkpoint kind or such, you can remove it.

4) s:/'recycling old XLOG files'/'recycling old WAL files'
+ WHEN 16 THEN 'recycling old XLOG files'

5) Can we place CREATE VIEW pg_stat_progress_checkpoint AS definition
next to pg_stat_progress_copy in system_view.sql? It looks like all
the progress reporting views are next to each other.

I will take care in the next patch.
---

6) How about shutdown and end-of-recovery checkpoint? Are you planning
to have an ereport_startup_progress mechanism as 0002?

7) I think you don't need to call checkpoint_progress_start and
pgstat_progress_update_param, any other progress reporting function
for shutdown and end-of-recovery checkpoint right?

I had included the guards earlier and then removed later based on the
discussion upthread.
---

8) Not for all kinds of checkpoints right? pg_stat_progress_checkpoint
can't show progress report for shutdown and end-of-recovery
checkpoint, I think you need to specify that here in wal.sgml and
checkpoint.sgml.
+   command <command>CHECKPOINT</command>. The checkpointer process running the
+   checkpoint will report its progress in the
+   <structname>pg_stat_progress_checkpoint</structname> view. See
+   <xref linkend="checkpoint-progress-reporting"/> for details.
9) Can you add a test case for pg_stat_progress_checkpoint view? I
think it's good to add one. See, below for reference:
-- Add a trigger to catch and print the contents of the catalog view
-- pg_stat_progress_copy during data insertion. This allows to test
-- the validation of some progress reports for COPY FROM where the trigger
-- would fire.
create function notice_after_tab_progress_reporting() returns trigger AS
$$
declare report record;

10) Typo: it's not "is happens"
+ The checkpoint is happens without delays.

11) Can you be specific what are those "some operations" that forced a
checkpoint? May be like, basebackup, createdb or something?
+ The checkpoint is started because some operation forced a checkpoint.

12) Can you be a bit elobartive here who waits? Something like the
backend that requested checkpoint will wait until it's completion ....
+ Wait for completion before returning.

13) "removing unneeded or flushing needed logical rewrite mapping files"
+ The checkpointer process is currently removing/flushing the logical

14) "old WAL files"
+ The checkpointer process is currently recycling old XLOG files.

I will take care in the next patch.

Thanks & Regards,
Nitin Jadhav

On Wed, Mar 2, 2022 at 11:52 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Show quoted text

On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Also, how about special phases for SyncPostCheckpoint(),
SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
it might be increase in future (?)), TruncateSUBTRANS()?

SyncPreCheckpoint() is just incrementing a counter and
PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
there is no need to add any phases for these as of now. We can add in
the future if necessary. Added phases for SyncPostCheckpoint(),
InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().

Okay.

10) I'm not sure if it's discussed, how about adding the number of
snapshot/mapping files so far the checkpoint has processed in file
processing while loops of
CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
be many logical snapshot or mapping files and users may be interested
in knowing the so-far-processed-file-count.

I had thought about this while sharing the v1 patch and mentioned my
views upthread. I feel it won't give meaningful progress information
(It can be treated as statistics). Hence not included. Thoughts?

Okay. If there are any complaints about it we can always add them later.

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.

Yeah, you are right. Let's not go that path and store an array of
pids. I don't see a strong use-case with the pid of the process
requesting checkpoint. If required, we can add it later once the
pg_stat_progress_checkpoint view gets in.

I don't think that's really necessary to give the pid list.

If you requested a new checkpoint, it doesn't matter if it's only your backend
that triggered it, another backend or a few other dozen, the result will be the
same and you have the information that the request has been seen. We could
store just a bool for that but having a number instead also gives a bit more
information and may allow you to detect some broken logic on your client code
if it keeps increasing.

It's a good metric to show in the view but the information is not
readily available. Additional code is required to calculate the number
of requests. Is it worth doing that? I feel this can be added later if
required.

Yes, we can always add it later if required.

Please find the v4 patch attached and share your thoughts.

I reviewed v4 patch, here are my comments:

1) Can we convert below into pgstat_progress_update_multi_param, just
to avoid function calls?
pgstat_progress_update_param(PROGRESS_CHECKPOINT_LSN, checkPoint.redo);
pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,

2) Why are we not having special phase for CheckPointReplicationOrigin
as it does good bunch of work (writing to disk, XLogFlush,
durable_rename) especially when max_replication_slots is large?

3) I don't think "requested" is necessary here as it doesn't add any
value or it's not a checkpoint kind or such, you can remove it.

4) s:/'recycling old XLOG files'/'recycling old WAL files'
+ WHEN 16 THEN 'recycling old XLOG files'

5) Can we place CREATE VIEW pg_stat_progress_checkpoint AS definition
next to pg_stat_progress_copy in system_view.sql? It looks like all
the progress reporting views are next to each other.

6) How about shutdown and end-of-recovery checkpoint? Are you planning
to have an ereport_startup_progress mechanism as 0002?

7) I think you don't need to call checkpoint_progress_start and
pgstat_progress_update_param, any other progress reporting function
for shutdown and end-of-recovery checkpoint right?
8) Not for all kinds of checkpoints right? pg_stat_progress_checkpoint
can't show progress report for shutdown and end-of-recovery
checkpoint, I think you need to specify that here in wal.sgml and
checkpoint.sgml.
+   command <command>CHECKPOINT</command>. The checkpointer process running the
+   checkpoint will report its progress in the
+   <structname>pg_stat_progress_checkpoint</structname> view. See
+   <xref linkend="checkpoint-progress-reporting"/> for details.
9) Can you add a test case for pg_stat_progress_checkpoint view? I
think it's good to add one. See, below for reference:
-- Add a trigger to catch and print the contents of the catalog view
-- pg_stat_progress_copy during data insertion. This allows to test
-- the validation of some progress reports for COPY FROM where the trigger
-- would fire.
create function notice_after_tab_progress_reporting() returns trigger AS
$$
declare report record;

10) Typo: it's not "is happens"
+ The checkpoint is happens without delays.

11) Can you be specific what are those "some operations" that forced a
checkpoint? May be like, basebackup, createdb or something?
+ The checkpoint is started because some operation forced a checkpoint.

12) Can you be a bit elobartive here who waits? Something like the
backend that requested checkpoint will wait until it's completion ....
+ Wait for completion before returning.

13) "removing unneeded or flushing needed logical rewrite mapping files"
+ The checkpointer process is currently removing/flushing the logical

14) "old WAL files"
+ The checkpointer process is currently recycling old XLOG files.

Regards,
Bharath Rupireddy.

#57

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#52)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of checkpoint. See <xref linkend="checkpoint-types"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>kind</structfield> <type>text</type>
+      </para>
+      <para>
+       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
+      </para></entry>
+     </row>

Makes sense. I will change in the next patch.
---

+      <entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
+      <entry>One row only, showing the progress of the checkpoint.
Let's make this message consistent with the already existing message
for pg_stat_wal_receiver. See description for pg_stat_wal_receiver
view in "Dynamic Statistics Views" table.

You want me to change "One row only" to "Only one row" ? If that is
the case then for other views in the "Collected Statistics Views"
table, it is referred as "One row only". Let me know if you are
pointing out something else.
---

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time

I think the output in the kind column can be displayed as {immediate,
force, wait, requested, time}. By the way these are all checkpoint
flags so it is better to display it as checkpoint flags instead of
checkpoint kind as mentioned in one of my previous comments.

I will update in the next patch.
---

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time
start_lsn | 0/14C60F8
start_time | 2022-03-03 18:59:56.018662+05:30
phase | performing two phase checkpoint

This is the output I see when the checkpointer process has come out of
the two phase checkpoint and is currently writing checkpoint xlog
records and doing other stuff like updating control files etc. Is this
okay?

The output of log_checkpoint shows the number of buffers written is 3
whereas the output of pg_stat_progress_checkpoint shows it as 0. See
below:

2022-03-03 20:04:45.643 IST [22043] LOG: checkpoint complete: wrote 3
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB

--

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time
start_lsn | 0/14C60F8
start_time | 2022-03-03 18:59:56.018662+05:30
phase | finalizing
buffers_total | 0
buffers_processed | 0
buffers_written | 0

Any idea why this mismatch?

I think we can add a couple of more information to this view -
start_time for buffer write operation and start_time for buffer sync
operation. These are two very time consuming tasks in a checkpoint and
people would find it useful to know how much time is being taken by
the checkpoint in I/O operation phase. thoughts?

The detailed progress is getting shown for these 2 phases of the
checkpoint like 'buffers_processed', 'buffers_written' and
'files_synced'. Hence I did not think about adding start time and If
it is really required, then I can add.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Thu, Mar 3, 2022 at 8:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Here are some of my review comments on the latest patch:
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of checkpoint. See <xref linkend="checkpoint-types"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>kind</structfield> <type>text</type>
+      </para>
+      <para>
+       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
+      </para></entry>
+     </row>
This looks a bit confusing. Two columns, one with the name "checkpoint
types" and another "checkpoint kinds". You can probably rename
checkpoint-kinds to checkpoint-flags and let the checkpoint-types be
as-it-is.

==
+      <entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
+      <entry>One row only, showing the progress of the checkpoint.
Let's make this message consistent with the already existing message
for pg_stat_wal_receiver. See description for pg_stat_wal_receiver
view in "Dynamic Statistics Views" table.

==

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time

I think the output in the kind column can be displayed as {immediate,
force, wait, requested, time}. By the way these are all checkpoint
flags so it is better to display it as checkpoint flags instead of
checkpoint kind as mentioned in one of my previous comments.

==

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time
start_lsn | 0/14C60F8
start_time | 2022-03-03 18:59:56.018662+05:30
phase | performing two phase checkpoint

This is the output I see when the checkpointer process has come out of
the two phase checkpoint and is currently writing checkpoint xlog
records and doing other stuff like updating control files etc. Is this
okay?

==

The output of log_checkpoint shows the number of buffers written is 3
whereas the output of pg_stat_progress_checkpoint shows it as 0. See
below:

2022-03-03 20:04:45.643 IST [22043] LOG: checkpoint complete: wrote 3
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB

--

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time
start_lsn | 0/14C60F8
start_time | 2022-03-03 18:59:56.018662+05:30
phase | finalizing
buffers_total | 0
buffers_processed | 0
buffers_written | 0

Any idea why this mismatch?

==

I think we can add a couple of more information to this view -
start_time for buffer write operation and start_time for buffer sync
operation. These are two very time consuming tasks in a checkpoint and
people would find it useful to know how much time is being taken by
the checkpoint in I/O operation phase. thoughts?

--
With Regards,
Ashutosh Sharma.

On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
Thanks for reviewing.

I suggested upthread to store the starting timeline instead. This way you can
deduce whether it's a restartpoint or a checkpoint, but you can also deduce
other information, like what was the starting WAL.

I don't understand why we need the timeline here to just determine
whether it's a restartpoint or checkpoint.

I'm not saying it's necessary, I'm saying that for the same space usage we can
store something a bit more useful. If no one cares about having the starting
timeline available for no extra cost then sure, let's just store the kind
directly.

Fixed.
2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
directly instead of new function pg_stat_get_progress_checkpoint_kind?
+ snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
+ (flags == 0) ? "unknown" : "",
+ (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
+ (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
+ (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
+ (flags & CHECKPOINT_FORCE) ? "force " : "",
+ (flags & CHECKPOINT_WAIT) ? "wait " : "",
+ (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
+ (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
+ (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");
Fixed.
---

5) Do we need a special phase for this checkpoint operation? I'm not
sure in which cases it will take a long time, but it looks like
there's a wait loop here.
vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
if (nvxids > 0)
{
do
{
pg_usleep(10000L); /* wait for 10 msec */
} while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
}

Yes. It is better to add a separate phase here.
---

Also, how about special phases for SyncPostCheckpoint(),
SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
it might be increase in future (?)), TruncateSUBTRANS()?

SyncPreCheckpoint() is just incrementing a counter and
PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
there is no need to add any phases for these as of now. We can add in
the future if necessary. Added phases for SyncPostCheckpoint(),
InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().
---
6) SLRU (Simple LRU) isn't a phase here, you can just say
PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES.
+
+ pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+ PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
CheckPointPredicate();
And :s/checkpointing SLRU pages/checkpointing predicate lock pages
+ WHEN 9 THEN 'checkpointing SLRU pages'
Fixed.
---

7) :s/PROGRESS_CHECKPOINT_PHASE_FILE_SYNC/PROGRESS_CHECKPOINT_PHASE_PROCESS_FILE_SYNC_REQUESTS

I feel PROGRESS_CHECKPOINT_PHASE_FILE_SYNC is a better option here as
it describes the purpose in less words.

And :s/WHEN 11 THEN 'performing sync requests'/WHEN 11 THEN
'processing file sync requests'

Fixed.
---

8) :s/Finalizing/finalizing
+ WHEN 14 THEN 'Finalizing'

Fixed.
---
9) :s/checkpointing snapshots/checkpointing logical replication snapshot files
+                      WHEN 3 THEN 'checkpointing snapshots'
:s/checkpointing logical rewrite mappings/checkpointing logical
replication rewrite mapping files
+                      WHEN 4 THEN 'checkpointing logical rewrite mappings'
Fixed.
---

10) I'm not sure if it's discussed, how about adding the number of
snapshot/mapping files so far the checkpoint has processed in file
processing while loops of
CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
be many logical snapshot or mapping files and users may be interested
in knowing the so-far-processed-file-count.

I had thought about this while sharing the v1 patch and mentioned my
views upthread. I feel it won't give meaningful progress information
(It can be treated as statistics). Hence not included. Thoughts?

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.

Yeah, you are right. Let's not go that path and store an array of
pids. I don't see a strong use-case with the pid of the process
requesting checkpoint. If required, we can add it later once the
pg_stat_progress_checkpoint view gets in.

I don't think that's really necessary to give the pid list.

If you requested a new checkpoint, it doesn't matter if it's only your backend
that triggered it, another backend or a few other dozen, the result will be the
same and you have the information that the request has been seen. We could
store just a bool for that but having a number instead also gives a bit more
information and may allow you to detect some broken logic on your client code
if it keeps increasing.

It's a good metric to show in the view but the information is not
readily available. Additional code is required to calculate the number
of requests. Is it worth doing that? I feel this can be added later if
required.

Please find the v4 patch attached and share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Tue, Mar 1, 2022 at 2:27 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
3) Why do we need this extra calculation for start_lsn?
Do you ever see a negative LSN or something here?
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,
Yes: LSN can take up all of an uint64; whereas the pgstat column is a
bigint type; thus the signed int64. This cast is OK as it wraps
around, but that means we have to take care to correctly display the
LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
the special-casing for negative values.
Yes. The extra calculation is required here as we are storing unit64
value in the variable of type int64. When we convert uint64 to int64
then the bit pattern is preserved (so no data is lost). The high-order
bit becomes the sign bit and if the sign bit is set, both the sign and
magnitude of the value changes. To safely get the actual uint64 value
whatever was assigned, we need the above calculations.

4) Can't you use timestamptz_in(to_char(s.param4)) instead of
pg_stat_get_progress_checkpoint_start_time? I don't quite understand
the reasoning for having this function and it's named as *checkpoint*
when it doesn't do anything specific to the checkpoint at all?

I hadn't thought of using the types' inout functions, but it looks
like timestamp IO functions use a formatted timestring, which won't
work with the epoch-based timestamp stored in the view.

There is a variation of to_timestamp() which takes UNIX epoch (float8)
as an argument and converts it to timestamptz but we cannot directly
call this function with S.param4.

TimestampTz
GetCurrentTimestamp(void)
{
TimestampTz result;
struct timeval tp;

gettimeofday(&tp, NULL);

result = (TimestampTz) tp.tv_sec -
((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
result = (result * USECS_PER_SEC) + tp.tv_usec;

return result;
}

S.param4 contains the output of the above function
(GetCurrentTimestamp()) which returns Postgres epoch but the
to_timestamp() expects UNIX epoch as input. So some calculation is
required here. I feel the SQL 'to_timestamp(946684800 +
(S.param4::float / 1000000)) AS start_time' works fine. The value
'946684800' is equal to ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) *
SECS_PER_DAY). I am not sure whether it is good practice to use this
way. Kindly share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Mon, Feb 28, 2022 at 6:40 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Sun, 27 Feb 2022 at 16:14, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
3) Why do we need this extra calculation for start_lsn?
Do you ever see a negative LSN or something here?
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,
Yes: LSN can take up all of an uint64; whereas the pgstat column is a
bigint type; thus the signed int64. This cast is OK as it wraps
around, but that means we have to take care to correctly display the
LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
the special-casing for negative values.

As to whether it is reasonable: Generating 16GB of wal every second
(2^34 bytes /sec) is probably not impossible (cpu <> memory bandwidth
has been > 20GB/sec for a while); and that leaves you 2^29 seconds of
database runtime; or about 17 years. Seeing that a cluster can be
`pg_upgrade`d (which doesn't reset cluster LSN) since PG 9.0 from at
least version PG 8.4.0 (2009) (and through pg_migrator, from 8.3.0)),
we can assume that clusters hitting LSN=2^63 will be a reasonable
possibility within the next few years. As the lifespan of a PG release
is about 5 years, it doesn't seem impossible that there will be actual
clusters that are going to hit this naturally in the lifespan of PG15.

It is also possible that someone fat-fingers pg_resetwal; and creates
a cluster with LSN >= 2^63; resulting in negative values in the
s.param3 field. Not likely, but we can force such situations; and as
such we should handle that gracefully.

4) Can't you use timestamptz_in(to_char(s.param4)) instead of
pg_stat_get_progress_checkpoint_start_time? I don't quite understand
the reasoning for having this function and it's named as *checkpoint*
when it doesn't do anything specific to the checkpoint at all?

I hadn't thought of using the types' inout functions, but it looks
like timestamp IO functions use a formatted timestring, which won't
work with the epoch-based timestamp stored in the view.

Having 3 unnecessary functions that aren't useful to the users at all
in proc.dat will simply eatup the function oids IMO. Hence, I suggest
let's try to do without extra functions.

I agree that (1) could be simplified, or at least fully expressed in
SQL without exposing too many internals. If we're fine with exposing
internals like flags and type layouts, then (2), and arguably (4), can
be expressed in SQL as well.

-Matthias

#58

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#56)

1 attachment(s)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

11) Can you be specific what are those "some operations" that forced a
checkpoint? May be like, basebackup, createdb or something?
+ The checkpoint is started because some operation forced a checkpoint.

I will take care in the next patch.

I feel mentioning/listing the specific operation makes it difficult to
maintain the document. If we add any new functionality in future which
needs a force checkpoint, then there is a high chance that we will
miss to update here. Hence modified it to "The checkpoint is started
because some operation (for which the checkpoint is necessary) is
forced the checkpoint".

Fixed other comments as per the discussion above.
Please find the v5 patch attached and share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Mon, Mar 7, 2022 at 7:45 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Show quoted text

1) Can we convert below into pgstat_progress_update_multi_param, just
to avoid function calls?
pgstat_progress_update_param(PROGRESS_CHECKPOINT_LSN, checkPoint.redo);
pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,

2) Why are we not having special phase for CheckPointReplicationOrigin
as it does good bunch of work (writing to disk, XLogFlush,
durable_rename) especially when max_replication_slots is large?

3) I don't think "requested" is necessary here as it doesn't add any
value or it's not a checkpoint kind or such, you can remove it.

4) s:/'recycling old XLOG files'/'recycling old WAL files'
+ WHEN 16 THEN 'recycling old XLOG files'

5) Can we place CREATE VIEW pg_stat_progress_checkpoint AS definition
next to pg_stat_progress_copy in system_view.sql? It looks like all
the progress reporting views are next to each other.

I will take care in the next patch.
---

6) How about shutdown and end-of-recovery checkpoint? Are you planning
to have an ereport_startup_progress mechanism as 0002?

I thought of including it earlier then I felt lets first make the
current patch stable. Once all the fields are properly decided and the
patch gets in then we can easily extend the functionality to shutdown
and end-of-recovery cases. I have also observed that the timer
functionality wont work properly in case of shutdown as we are doing
an immediate checkpoint. So this needs a lot of discussion and I would
like to handle this on a separate thread.
---

7) I think you don't need to call checkpoint_progress_start and
pgstat_progress_update_param, any other progress reporting function
for shutdown and end-of-recovery checkpoint right?

I had included the guards earlier and then removed later based on the
discussion upthread.
---
8) Not for all kinds of checkpoints right? pg_stat_progress_checkpoint
can't show progress report for shutdown and end-of-recovery
checkpoint, I think you need to specify that here in wal.sgml and
checkpoint.sgml.
+   command <command>CHECKPOINT</command>. The checkpointer process running the
+   checkpoint will report its progress in the
+   <structname>pg_stat_progress_checkpoint</structname> view. See
+   <xref linkend="checkpoint-progress-reporting"/> for details.
9) Can you add a test case for pg_stat_progress_checkpoint view? I
think it's good to add one. See, below for reference:
-- Add a trigger to catch and print the contents of the catalog view
-- pg_stat_progress_copy during data insertion. This allows to test
-- the validation of some progress reports for COPY FROM where the trigger
-- would fire.
create function notice_after_tab_progress_reporting() returns trigger AS
$$
declare report record;

10) Typo: it's not "is happens"
+ The checkpoint is happens without delays.

11) Can you be specific what are those "some operations" that forced a
checkpoint? May be like, basebackup, createdb or something?
+ The checkpoint is started because some operation forced a checkpoint.

12) Can you be a bit elobartive here who waits? Something like the
backend that requested checkpoint will wait until it's completion ....
+ Wait for completion before returning.

13) "removing unneeded or flushing needed logical rewrite mapping files"
+ The checkpointer process is currently removing/flushing the logical

14) "old WAL files"
+ The checkpointer process is currently recycling old XLOG files.
I will take care in the next patch.

Thanks & Regards,
Nitin Jadhav

On Wed, Mar 2, 2022 at 11:52 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Also, how about special phases for SyncPostCheckpoint(),
SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
it might be increase in future (?)), TruncateSUBTRANS()?

SyncPreCheckpoint() is just incrementing a counter and
PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
there is no need to add any phases for these as of now. We can add in
the future if necessary. Added phases for SyncPostCheckpoint(),
InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().

Okay.

10) I'm not sure if it's discussed, how about adding the number of
snapshot/mapping files so far the checkpoint has processed in file
processing while loops of
CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
be many logical snapshot or mapping files and users may be interested
in knowing the so-far-processed-file-count.

I had thought about this while sharing the v1 patch and mentioned my
views upthread. I feel it won't give meaningful progress information
(It can be treated as statistics). Hence not included. Thoughts?

Okay. If there are any complaints about it we can always add them later.

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.

Yeah, you are right. Let's not go that path and store an array of
pids. I don't see a strong use-case with the pid of the process
requesting checkpoint. If required, we can add it later once the
pg_stat_progress_checkpoint view gets in.

I don't think that's really necessary to give the pid list.

If you requested a new checkpoint, it doesn't matter if it's only your backend
that triggered it, another backend or a few other dozen, the result will be the
same and you have the information that the request has been seen. We could
store just a bool for that but having a number instead also gives a bit more
information and may allow you to detect some broken logic on your client code
if it keeps increasing.

It's a good metric to show in the view but the information is not
readily available. Additional code is required to calculate the number
of requests. Is it worth doing that? I feel this can be added later if
required.

Yes, we can always add it later if required.

Please find the v4 patch attached and share your thoughts.

I reviewed v4 patch, here are my comments:

1) Can we convert below into pgstat_progress_update_multi_param, just
to avoid function calls?
pgstat_progress_update_param(PROGRESS_CHECKPOINT_LSN, checkPoint.redo);
pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,

2) Why are we not having special phase for CheckPointReplicationOrigin
as it does good bunch of work (writing to disk, XLogFlush,
durable_rename) especially when max_replication_slots is large?

3) I don't think "requested" is necessary here as it doesn't add any
value or it's not a checkpoint kind or such, you can remove it.

4) s:/'recycling old XLOG files'/'recycling old WAL files'
+ WHEN 16 THEN 'recycling old XLOG files'

5) Can we place CREATE VIEW pg_stat_progress_checkpoint AS definition
next to pg_stat_progress_copy in system_view.sql? It looks like all
the progress reporting views are next to each other.

6) How about shutdown and end-of-recovery checkpoint? Are you planning
to have an ereport_startup_progress mechanism as 0002?

7) I think you don't need to call checkpoint_progress_start and
pgstat_progress_update_param, any other progress reporting function
for shutdown and end-of-recovery checkpoint right?
8) Not for all kinds of checkpoints right? pg_stat_progress_checkpoint
can't show progress report for shutdown and end-of-recovery
checkpoint, I think you need to specify that here in wal.sgml and
checkpoint.sgml.
+   command <command>CHECKPOINT</command>. The checkpointer process running the
+   checkpoint will report its progress in the
+   <structname>pg_stat_progress_checkpoint</structname> view. See
+   <xref linkend="checkpoint-progress-reporting"/> for details.
9) Can you add a test case for pg_stat_progress_checkpoint view? I
think it's good to add one. See, below for reference:
-- Add a trigger to catch and print the contents of the catalog view
-- pg_stat_progress_copy during data insertion. This allows to test
-- the validation of some progress reports for COPY FROM where the trigger
-- would fire.
create function notice_after_tab_progress_reporting() returns trigger AS
$$
declare report record;

10) Typo: it's not "is happens"
+ The checkpoint is happens without delays.

11) Can you be specific what are those "some operations" that forced a
checkpoint? May be like, basebackup, createdb or something?
+ The checkpoint is started because some operation forced a checkpoint.

12) Can you be a bit elobartive here who waits? Something like the
backend that requested checkpoint will wait until it's completion ....
+ Wait for completion before returning.

13) "removing unneeded or flushing needed logical rewrite mapping files"
+ The checkpointer process is currently removing/flushing the logical

14) "old WAL files"
+ The checkpointer process is currently recycling old XLOG files.

Regards,
Bharath Rupireddy.

Attachments:

v5-0001-pg_stat_progress_checkpoint-view.patchapplication/octet-stream; name=v5-0001-pg_stat_progress_checkpoint-view.patchDownload

From 8e1ec8bb49dfe7c753e11ac5c97a82751dc1d1b2 Mon Sep 17 00:00:00 2001
From: Nitin Jadhav <nitinjadhav@microsoft.com>
Date: Tue, 8 Mar 2022 14:36:48 +0000
Subject: [PATCH] pg_stat_progress_checkpoint view

---
 doc/src/sgml/monitoring.sgml          | 404 +++++++++++++++++++++++++-
 doc/src/sgml/ref/checkpoint.sgml      |   7 +
 doc/src/sgml/wal.sgml                 |   6 +-
 src/backend/access/transam/xlog.c     | 102 +++++++
 src/backend/catalog/system_views.sql  |  59 ++++
 src/backend/postmaster/checkpointer.c |   4 +
 src/backend/storage/buffer/bufmgr.c   |   7 +
 src/backend/storage/sync/sync.c       |   6 +
 src/backend/utils/adt/pgstatfuncs.c   |   2 +
 src/include/commands/progress.h       |  38 +++
 src/include/utils/backend_progress.h  |   3 +-
 src/test/regress/expected/rules.out   | 106 +++++++
 12 files changed, 741 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9fb62fec8e..18f4cb4221 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -401,6 +401,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
        See <xref linkend='copy-progress-reporting'/>.
       </entry>
      </row>
+
+     <row>
+      <entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
+      <entry>One row only, showing the progress of the checkpoint.
+       See <xref linkend='checkpoint-progress-reporting'/>.
+      </entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
@@ -5556,7 +5563,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
    which support progress reporting are <command>ANALYZE</command>,
    <command>CLUSTER</command>,
    <command>CREATE INDEX</command>, <command>VACUUM</command>,
-   <command>COPY</command>,
+   <command>COPY</command>, <command>CHECKPOINT</command>
    and <xref linkend="protocol-replication-base-backup"/> (i.e., replication
    command that <xref linkend="app-pgbasebackup"/> issues to take
    a base backup).
@@ -6844,6 +6851,401 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   </table>
  </sect2>
 
+ <sect2 id="checkpoint-progress-reporting">
+  <title>Checkpoint Progress Reporting</title>
+
+  <indexterm>
+   <primary>pg_stat_progress_checkpoint</primary>
+  </indexterm>
+
+  <para>
+   Whenever the checkpoint operation is running, the
+   <structname>pg_stat_progress_checkpoint</structname> view will contain a
+   single row indicating the progress of the checkpoint. The tables below
+   describe the information that will be reported and provide information about
+   how to interpret it.
+  </para>
+
+  <table id="pg-stat-progress-checkpoint-view" xreflabel="pg_stat_progress_checkpoint">
+   <title><structname>pg_stat_progress_checkpoint</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>pid</structfield> <type>integer</type>
+      </para>
+      <para>
+       Process ID of the checkpointer process.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of the checkpoint. See <xref linkend="checkpoint-types"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>flags</structfield> <type>text</type>
+      </para>
+      <para>
+       Flags of the checkpoint. See <xref linkend="checkpoint-flags"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>next_flags</structfield> <type>text</type>
+      </para>
+      <para>
+       Flags of the next checkpoint. See <xref linkend="checkpoint-flags"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>start_lsn</structfield> <type>text</type>
+      </para>
+      <para>
+       The checkpoint start location.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>start_time</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Elapsed time of the checkpoint.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>phase</structfield> <type>text</type>
+      </para>
+      <para>
+       Current processing phase. See <xref linkend="checkpoint-phases"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_total</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of buffers to be written. This is estimated and reported
+       as of the beginning of buffer write operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_processed</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers processed. This counter increases when the targeted
+       buffer is processed. This number will eventually become equal to
+       <literal>buffers_total</literal> when the checkpoint is
+       complete.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers written. This counter only advances when the targeted
+       buffers is written. Note that some of the buffers are processed but may
+       not required to be written. So this count will always be  less than or
+       equal to  <literal>buffers_total</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_total</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of files to be synced. This is estimated and reported as of
+       the beginning of sync operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of files synced. This counter advances when the targeted file is
+       synced. This number will eventually become equal to
+       <literal>files_total</literal>  when the checkpoint is complete.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-types">
+   <title>Checkpoint Types</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Types</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>checkpoint</literal></entry>
+      <entry>
+       The current operation is checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>restartpoint</literal></entry>
+      <entry>
+       The current operation is restartpoint.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-flags">
+   <title>Checkpoint Flags</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Flags</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>shutdown</literal></entry>
+      <entry>
+       The checkpoint is for shutdown.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>end-of-recovery</literal></entry>
+      <entry>
+       The checkpoint is for end-of-recovery.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>immediate</literal></entry>
+      <entry>
+       The checkpoint happens without delays.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>force</literal></entry>
+      <entry>
+       The checkpoint is started because some operation (for which the
+       checkpoint is necessary) forced a checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>flush all</literal></entry>
+      <entry>
+       The checkpoint flushes all pages, including those belonging to unlogged
+       tables.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>wait</literal></entry>
+      <entry>
+       The operations which requested the checkpoint waits for completion
+       before returning.
+      </entry>
+     </row>
+      <row>
+      <entry><literal>requested</literal></entry>
+      <entry>
+       The checkpoint request has been made.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>wal</literal></entry>
+      <entry>
+       The checkpoint is started because <literal>max_wal_size</literal> is
+       reached.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>time</literal></entry>
+      <entry>
+       The checkpoint is started because <literal>checkpoint_timeout</literal>
+       expired.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-phases">
+   <title>Checkpoint Phases</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Phase</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>initializing</literal></entry>
+      <entry>
+       The checkpointer process is preparing to begin the checkpoint operation.
+       This phase is expected to be very brief.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>getting virtual transaction IDs</literal></entry>
+      <entry>
+       The checkpointer process is getting the virtual transaction IDs that
+       are delaying the checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing replication slots</literal></entry>
+      <entry>
+       The checkpointer process is currently flushing all the replication slots
+       to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing logical replication snapshot files</literal></entry>
+      <entry>
+       The checkpointer process is currently removing all the serialized
+       snapshot files that are not required anymore.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing logical rewrite mapping files</literal></entry>
+      <entry>
+       The checkpointer process is currently removing unwanted or flushing
+       required logical rewrite mapping files.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing replication origin</literal></entry>
+      <entry>
+       The checkpointer process is currently performing a checkpoint of each
+       replication origin's progress with respect to the replayed remote LSN.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing commit log pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing commit log pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing commit time stamp pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing commit time stamp pages to
+       disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing subtransaction pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing subtransaction pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing multixact pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing multixact pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing predicate lock pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing predicate lock pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing buffers</literal></entry>
+      <entry>
+       The checkpointer process is currently writing buffers to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>processing file sync requests</literal></entry>
+      <entry>
+       The checkpointer process is currently processing file sync requests.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>performing two phase checkpoint</literal></entry>
+      <entry>
+       The checkpointer process is currently performing two phase checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>performing post checkpoint cleanup</literal></entry>
+      <entry>
+       The checkpointer process is currently performing post checkpoint cleanup.
+       It removes any lingering files that can be safely removed.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>invalidating replication slots</literal></entry>
+      <entry>
+       The checkpointer process is currently invalidating replication slots.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>recycling old WAL files</literal></entry>
+      <entry>
+       The checkpointer process is currently recycling old WAL files.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>truncating subtransactions</literal></entry>
+      <entry>
+       The checkpointer process is currently removing the subtransaction
+       segments.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>finalizing</literal></entry>
+      <entry>
+       The checkpointer process is finalizing the checkpoint operation. 
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  </sect1>
 
  <sect1 id="dynamic-trace">
diff --git a/doc/src/sgml/ref/checkpoint.sgml b/doc/src/sgml/ref/checkpoint.sgml
index 1cebc03d15..f33db50cfc 100644
--- a/doc/src/sgml/ref/checkpoint.sgml
+++ b/doc/src/sgml/ref/checkpoint.sgml
@@ -56,6 +56,13 @@ CHECKPOINT
    the <link linkend="predefined-roles-table"><literal>pg_checkpointer</literal></link>
    role can call <command>CHECKPOINT</command>.
   </para>
+
+  <para>
+    The checkpointer process running the checkpoint will report its progress
+    in the <structname>pg_stat_progress_checkpoint</structname> view except for
+    the shutdown and end-of-recovery cases. See
+    <xref linkend="checkpoint-progress-reporting"/> for details.
+  </para>
  </refsect1>
 
  <refsect1>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 2bb27a8468..8520304abb 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -530,7 +530,11 @@
    adjust the <xref linkend="guc-archive-timeout"/> parameter rather than the
    checkpoint parameters.)
    It is also possible to force a checkpoint by using the SQL
-   command <command>CHECKPOINT</command>.
+   command <command>CHECKPOINT</command>. The checkpointer process running the 
+   checkpoint will report its progress in the
+   <structname>pg_stat_progress_checkpoint</structname> view except for the
+   shutdown and end-of-recovery cases. See
+   <xref linkend="checkpoint-progress-reporting"/> for details.
   </para>
 
   <para>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2bd7a357..05dabc6a94 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -65,6 +65,7 @@
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
+#include "commands/progress.h"
 #include "common/controldata_utils.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
@@ -719,6 +720,8 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static void checkpoint_progress_start(int flags, int type);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -6419,6 +6422,9 @@ CreateCheckPoint(int flags)
 	XLogCtl->RedoRecPtr = checkPoint.redo;
 	SpinLockRelease(&XLogCtl->info_lck);
 
+	/* Prepare to report progress of the checkpoint. */
+	checkpoint_progress_start(flags, PROGRESS_CHECKPOINT_TYPE_CHECKPOINT);
+
 	/*
 	 * If enabled, log checkpoint start.  We postpone this until now so as not
 	 * to log anything if we decided to skip the checkpoint.
@@ -6501,6 +6507,8 @@ CreateCheckPoint(int flags)
 	 * and we will correctly flush the update below.  So we cannot miss any
 	 * xacts we need to wait for.
 	 */
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_GET_VIRTUAL_TRANSACTION_IDS);
 	vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
 	if (nvxids > 0)
 	{
@@ -6604,6 +6612,8 @@ CreateCheckPoint(int flags)
 	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_POST_CHECKPOINT_CLEANUP);
 	SyncPostCheckpoint();
 
 	/*
@@ -6619,6 +6629,9 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_INVALIDATE_REPLI_SLOTS);
+
 	if (InvalidateObsoleteReplicationSlots(_logSegNo))
 	{
 		/*
@@ -6629,6 +6642,8 @@ CreateCheckPoint(int flags)
 		KeepLogSeg(recptr, &_logSegNo);
 	}
 	_logSegNo--;
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_RECYCLE_OLD_XLOG);
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr,
 					   checkPoint.ThisTimeLineID);
 
@@ -6647,11 +6662,21 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
+	{
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_TRUNCATE_SUBTRANS);
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+	}
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_FINALIZE);
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(false);
 
+	/* Stop reporting progress of the checkpoint. */
+	pgstat_progress_end_command();
+
 	/* Reset the process title */
 	update_checkpoint_display(flags, false, true);
 
@@ -6808,29 +6833,63 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS);
 	CheckPointReplicationSlots();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS);
 	CheckPointSnapBuild();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS);
 	CheckPointLogicalRewriteHeap();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_REPLI_ORIGIN);
 	CheckPointReplicationOrigin();
 
 	/* Write out all dirty data in SLRUs and the main buffer pool */
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES);
 	CheckPointCLOG();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES);
 	CheckPointCommitTs();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES);
 	CheckPointSUBTRANS();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES);
 	CheckPointMultiXact();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES);
 	CheckPointPredicate();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_BUFFERS);
 	CheckPointBuffers(flags);
 
 	/* Perform all queued up fsyncs */
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SYNC_FILES);
 	ProcessSyncRequests();
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 
 	/* We deliberately delay 2PC checkpointing as long as possible */
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_TWO_PHASE);
 	CheckPointTwoPhase(checkPointRedo);
 }
 
@@ -6977,6 +7036,9 @@ CreateRestartPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	/* Prepare to report progress of the restartpoint. */
+	checkpoint_progress_start(flags, PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT);
+
 	if (log_checkpoints)
 		LogCheckpointStart(flags, true);
 
@@ -7051,6 +7113,9 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_INVALIDATE_REPLI_SLOTS);
+
 	if (InvalidateObsoleteReplicationSlots(_logSegNo))
 	{
 		/*
@@ -7077,6 +7142,8 @@ CreateRestartPoint(int flags)
 	if (!RecoveryInProgress())
 		replayTLI = XLogCtl->InsertTimeLineID;
 
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_RECYCLE_OLD_XLOG);
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr, replayTLI);
 
 	/*
@@ -7093,11 +7160,20 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
+	{
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_TRUNCATE_SUBTRANS);
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+	}
 
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_FINALIZE);
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(true);
 
+	/* Stop reporting progress of the restartpoint. */
+	pgstat_progress_end_command();
+
 	/* Reset the process title */
 	update_checkpoint_display(flags, true, true);
 
@@ -9197,3 +9273,29 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl->WalWriterSleeping = sleeping;
 	SpinLockRelease(&XLogCtl->info_lck);
 }
+
+/*
+ * Start reporting progress of the checkpoint.
+ */
+static void
+checkpoint_progress_start(int flags, int type)
+{
+	const int	index[] = {
+		PROGRESS_CHECKPOINT_TYPE,
+		PROGRESS_CHECKPOINT_FLAGS,
+		PROGRESS_CHECKPOINT_LSN,
+		PROGRESS_CHECKPOINT_START_TIMESTAMP,
+		PROGRESS_CHECKPOINT_PHASE
+		};
+	int64		val[5];
+
+	pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
+
+	val[0] = type;
+	val[1] = flags;
+	val[2] = RedoRecPtr;
+	val[3] = CheckpointStats.ckpt_start_t;
+	val[4] = PROGRESS_CHECKPOINT_PHASE_INIT;
+
+	pgstat_progress_update_multi_param(5, index, val);
+}
\ No newline at end of file
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 40b7bca5a9..bd8583b038 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1228,6 +1228,65 @@ CREATE VIEW pg_stat_progress_copy AS
     FROM pg_stat_get_progress_info('COPY') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_progress_checkpoint AS
+    SELECT
+        S.pid AS pid,
+        CASE S.param1 WHEN 1 THEN 'checkpoint'
+                      WHEN 2 THEN 'restartpoint'
+                      END AS type,
+        ( CASE WHEN (S.param2 & 1) > 0 THEN 'shutdown ' ELSE '' END ||
+          CASE WHEN (S.param2 & 2) > 0 THEN 'end-of-recovery ' ELSE '' END ||
+          CASE WHEN (S.param2 & 4) > 0 THEN 'immediate ' ELSE '' END ||
+          CASE WHEN (S.param2 & 8) > 0 THEN 'force ' ELSE '' END ||
+          CASE WHEN (S.param2 & 16) > 0 THEN 'flush-all ' ELSE '' END ||
+          CASE WHEN (S.param2 & 32) > 0 THEN 'wait ' ELSE '' END ||
+          CASE WHEN (S.param2 & 128) > 0 THEN 'wal ' ELSE '' END ||
+          CASE WHEN (S.param2 & 256) > 0 THEN 'time ' ELSE '' END
+        ) AS flags,
+        ( CASE WHEN (S.param3 & 1) > 0 THEN 'shutdown ' ELSE '' END ||
+          CASE WHEN (S.param3 & 2) > 0 THEN 'end-of-recovery ' ELSE '' END ||
+          CASE WHEN (S.param3 & 4) > 0 THEN 'immediate ' ELSE '' END ||
+          CASE WHEN (S.param3 & 8) > 0 THEN 'force ' ELSE '' END ||
+          CASE WHEN (S.param3 & 16) > 0 THEN 'flush-all ' ELSE '' END ||
+          CASE WHEN (S.param3 & 32) > 0 THEN 'wait ' ELSE '' END ||
+          CASE WHEN (S.param3 & 128) > 0 THEN 'wal ' ELSE '' END ||
+          CASE WHEN (S.param3 & 256) > 0 THEN 'time ' ELSE '' END
+        ) AS next_flags,
+        ( '0/0'::pg_lsn +
+          ((CASE
+                WHEN S.param4 < 0 THEN pow(2::numeric, 64::numeric)::numeric
+                ELSE 0::numeric
+            END) +
+           S.param4::numeric)
+        ) AS start_lsn,
+        to_timestamp(946684800 + (S.param5::float8 / 1000000)) AS start_time,
+        CASE S.param6 WHEN 1 THEN 'initializing'
+                      WHEN 2 THEN 'getting virtual transaction IDs'
+                      WHEN 3 THEN 'checkpointing replication slots'
+                      WHEN 4 THEN 'checkpointing logical replication snapshot files'
+                      WHEN 5 THEN 'checkpointing logical rewrite mapping files'
+                      WHEN 6 THEN 'checkpointing replication origin'
+                      WHEN 7 THEN 'checkpointing commit log pages'
+                      WHEN 8 THEN 'checkpointing commit time stamp pages'
+                      WHEN 9 THEN 'checkpointing subtransaction pages'
+                      WHEN 10 THEN 'checkpointing multixact pages'
+                      WHEN 11 THEN 'checkpointing predicate lock pages'
+                      WHEN 12 THEN 'checkpointing buffers'
+                      WHEN 13 THEN 'processing file sync requests'
+                      WHEN 14 THEN 'performing two phase checkpoint'
+                      WHEN 15 THEN 'performing post checkpoint cleanup'
+                      WHEN 16 THEN 'invalidating replication slots'
+                      WHEN 17 THEN 'recycling old WAL files'
+                      WHEN 18 THEN 'truncating subtransactions'
+                      WHEN 19 THEN 'finalizing'
+                      END AS phase,
+        S.param7 AS buffers_total,
+        S.param8 AS buffers_processed,
+        S.param9 AS buffers_written,
+        S.param10 AS files_total,
+        S.param11 AS files_synced
+    FROM pg_stat_get_progress_info('CHECKPOINT') AS S;
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4488e3a443..9b7441ed40 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogrecovery.h"
+#include "commands/progress.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -662,6 +663,9 @@ ImmediateCheckpointRequested(void)
 {
 	volatile CheckpointerShmemStruct *cps = CheckpointerShmem;
 
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_NEXT_CHECKPOINT_FLAGS,
+								 cps->ckpt_flags);
+
 	/*
 	 * We don't need to acquire the ckpt_lck in this case because we're only
 	 * looking at a single flag bit.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f8..9663035d7a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
+#include "commands/progress.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -2012,6 +2013,8 @@ BufferSync(int flags)
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_TOTAL,
+								 num_to_scan);
 
 	/*
 	 * Sort buffers that need to be written to reduce the likelihood of random
@@ -2129,6 +2132,8 @@ BufferSync(int flags)
 		bufHdr = GetBufferDescriptor(buf_id);
 
 		num_processed++;
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_PROCESSED,
+									 num_processed);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -2149,6 +2154,8 @@ BufferSync(int flags)
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
+				pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_WRITTEN,
+											 num_written);
 			}
 		}
 
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e161d57761..638d3eb781 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -23,6 +23,7 @@
 #include "access/multixact.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "commands/progress.h"
 #include "commands/tablespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -356,6 +357,9 @@ ProcessSyncRequests(void)
 	/* Now scan the hashtable for fsync requests to process */
 	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOps);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_FILES_TOTAL,
+								 hash_get_num_entries(pendingOps));
+
 	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
 	{
 		int			failures;
@@ -419,6 +423,8 @@ ProcessSyncRequests(void)
 						longest = elapsed;
 					total_elapsed += elapsed;
 					processed++;
+					pgstat_progress_update_param(PROGRESS_CHECKPOINT_FILES_SYNCED,
+												 processed);
 
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index fd993d0d5f..95df730415 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -494,6 +494,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 		cmdtype = PROGRESS_COMMAND_BASEBACKUP;
 	else if (pg_strcasecmp(cmd, "COPY") == 0)
 		cmdtype = PROGRESS_COMMAND_COPY;
+	 else if (pg_strcasecmp(cmd, "CHECKPOINT") == 0)
+		cmdtype = PROGRESS_COMMAND_CHECKPOINT;
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index a28938caf4..316886e6f0 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -151,4 +151,42 @@
 #define PROGRESS_COPY_TYPE_PIPE 3
 #define PROGRESS_COPY_TYPE_CALLBACK 4
 
+/* Progress parameters for checkpoint */
+#define PROGRESS_CHECKPOINT_TYPE                    0
+#define PROGRESS_CHECKPOINT_FLAGS                   1
+#define PROGRESS_CHECKPOINT_NEXT_CHECKPOINT_FLAGS   2
+#define PROGRESS_CHECKPOINT_LSN                     3
+#define PROGRESS_CHECKPOINT_START_TIMESTAMP         4
+#define PROGRESS_CHECKPOINT_PHASE                   5
+#define PROGRESS_CHECKPOINT_BUFFERS_TOTAL           6
+#define PROGRESS_CHECKPOINT_BUFFERS_PROCESSED       7
+#define PROGRESS_CHECKPOINT_BUFFERS_WRITTEN         8
+#define PROGRESS_CHECKPOINT_FILES_TOTAL             9
+#define PROGRESS_CHECKPOINT_FILES_SYNCED            10
+
+/* Types of checkpoint (as advertised via PROGRESS_CHECKPOINT_TYPE) */
+#define PROGRESS_CHECKPOINT_TYPE_CHECKPOINT         1
+#define PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT       2
+
+/* Phases of checkpoint (as advertised via PROGRESS_CHECKPOINT_PHASE) */
+#define PROGRESS_CHECKPOINT_PHASE_INIT                          1
+#define PROGRESS_CHECKPOINT_PHASE_GET_VIRTUAL_TRANSACTION_IDS   2
+#define PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS                   3
+#define PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS                     4
+#define PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS      5
+#define PROGRESS_CHECKPOINT_PHASE_REPLI_ORIGIN                  6
+#define PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES                    7
+#define PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES                8
+#define PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES                9
+#define PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES               10
+#define PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES          11
+#define PROGRESS_CHECKPOINT_PHASE_BUFFERS                       12
+#define PROGRESS_CHECKPOINT_PHASE_SYNC_FILES                    13
+#define PROGRESS_CHECKPOINT_PHASE_TWO_PHASE                     14
+#define PROGRESS_CHECKPOINT_PHASE_POST_CHECKPOINT_CLEANUP       15
+#define PROGRESS_CHECKPOINT_PHASE_INVALIDATE_REPLI_SLOTS        16
+#define PROGRESS_CHECKPOINT_PHASE_RECYCLE_OLD_XLOG              17
+#define PROGRESS_CHECKPOINT_PHASE_TRUNCATE_SUBTRANS             18
+#define PROGRESS_CHECKPOINT_PHASE_FINALIZE                      19
+
 #endif
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 47bf8029b0..02d51fb948 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -27,7 +27,8 @@ typedef enum ProgressCommandType
 	PROGRESS_COMMAND_CLUSTER,
 	PROGRESS_COMMAND_CREATE_INDEX,
 	PROGRESS_COMMAND_BASEBACKUP,
-	PROGRESS_COMMAND_COPY
+	PROGRESS_COMMAND_COPY,
+	PROGRESS_COMMAND_CHECKPOINT
 } ProgressCommandType;
 
 #define PGSTAT_NUM_PROGRESS_PARAM	20
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac468568a1..b53a29e372 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1897,6 +1897,112 @@ pg_stat_progress_basebackup| SELECT s.pid,
     s.param4 AS tablespaces_total,
     s.param5 AS tablespaces_streamed
    FROM pg_stat_get_progress_info('BASEBACKUP'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
+pg_stat_progress_checkpoint| SELECT s.pid,
+        CASE s.param1
+            WHEN 1 THEN 'checkpoint'::text
+            WHEN 2 THEN 'restartpoint'::text
+            ELSE NULL::text
+        END AS type,
+    (((((((
+        CASE
+            WHEN ((s.param2 & (1)::bigint) > 0) THEN 'shutdown '::text
+            ELSE ''::text
+        END ||
+        CASE
+            WHEN ((s.param2 & (2)::bigint) > 0) THEN 'end-of-recovery '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (4)::bigint) > 0) THEN 'immediate '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (8)::bigint) > 0) THEN 'force '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (16)::bigint) > 0) THEN 'flush-all '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (32)::bigint) > 0) THEN 'wait '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (128)::bigint) > 0) THEN 'wal '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (256)::bigint) > 0) THEN 'time '::text
+            ELSE ''::text
+        END) AS flags,
+    (((((((
+        CASE
+            WHEN ((s.param3 & (1)::bigint) > 0) THEN 'shutdown '::text
+            ELSE ''::text
+        END ||
+        CASE
+            WHEN ((s.param3 & (2)::bigint) > 0) THEN 'end-of-recovery '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (4)::bigint) > 0) THEN 'immediate '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (8)::bigint) > 0) THEN 'force '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (16)::bigint) > 0) THEN 'flush-all '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (32)::bigint) > 0) THEN 'wait '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (128)::bigint) > 0) THEN 'wal '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (256)::bigint) > 0) THEN 'time '::text
+            ELSE ''::text
+        END) AS next_flags,
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param4 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param4)::numeric)) AS start_lsn,
+    to_timestamp(((946684800)::double precision + ((s.param5)::double precision / (1000000)::double precision))) AS start_time,
+        CASE s.param6
+            WHEN 1 THEN 'initializing'::text
+            WHEN 2 THEN 'getting virtual transaction IDs'::text
+            WHEN 3 THEN 'checkpointing replication slots'::text
+            WHEN 4 THEN 'checkpointing logical replication snapshot files'::text
+            WHEN 5 THEN 'checkpointing logical rewrite mapping files'::text
+            WHEN 6 THEN 'checkpointing replication origin'::text
+            WHEN 7 THEN 'checkpointing commit log pages'::text
+            WHEN 8 THEN 'checkpointing commit time stamp pages'::text
+            WHEN 9 THEN 'checkpointing subtransaction pages'::text
+            WHEN 10 THEN 'checkpointing multixact pages'::text
+            WHEN 11 THEN 'checkpointing predicate lock pages'::text
+            WHEN 12 THEN 'checkpointing buffers'::text
+            WHEN 13 THEN 'processing file sync requests'::text
+            WHEN 14 THEN 'performing two phase checkpoint'::text
+            WHEN 15 THEN 'performing post checkpoint cleanup'::text
+            WHEN 16 THEN 'invalidating replication slots'::text
+            WHEN 17 THEN 'recycling old WAL files'::text
+            WHEN 18 THEN 'truncating subtransactions'::text
+            WHEN 19 THEN 'finalizing'::text
+            ELSE NULL::text
+        END AS phase,
+    s.param7 AS buffers_total,
+    s.param8 AS buffers_processed,
+    s.param9 AS buffers_written,
+    s.param10 AS files_total,
+    s.param11 AS files_synced
+   FROM pg_stat_get_progress_info('CHECKPOINT'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
 pg_stat_progress_cluster| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.25.1

#59

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#57)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time

I think the output in the kind column can be displayed as {immediate,
force, wait, requested, time}. By the way these are all checkpoint
flags so it is better to display it as checkpoint flags instead of
checkpoint kind as mentioned in one of my previous comments.

I will update in the next patch.

The current format matches with the server log message for the
checkpoint start in LogCheckpointStart(). Just to be consistent, I
have not changed the code.

I have taken care of the rest of the comments in v5 patch for which
there was clarity.

Thanks & Regards,
Nitin Jadhav

On Mon, Mar 7, 2022 at 8:15 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Show quoted text

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of checkpoint. See <xref linkend="checkpoint-types"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>kind</structfield> <type>text</type>
+      </para>
+      <para>
+       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
+      </para></entry>
+     </row>
This looks a bit confusing. Two columns, one with the name "checkpoint
types" and another "checkpoint kinds". You can probably rename
checkpoint-kinds to checkpoint-flags and let the checkpoint-types be
as-it-is.
Makes sense. I will change in the next patch.
---
+      <entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
+      <entry>One row only, showing the progress of the checkpoint.
Let's make this message consistent with the already existing message
for pg_stat_wal_receiver. See description for pg_stat_wal_receiver
view in "Dynamic Statistics Views" table.
You want me to change "One row only" to "Only one row" ? If that is
the case then for other views in the "Collected Statistics Views"
table, it is referred as "One row only". Let me know if you are
pointing out something else.
---

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time

I think the output in the kind column can be displayed as {immediate,
force, wait, requested, time}. By the way these are all checkpoint
flags so it is better to display it as checkpoint flags instead of
checkpoint kind as mentioned in one of my previous comments.

I will update in the next patch.
---

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time
start_lsn | 0/14C60F8
start_time | 2022-03-03 18:59:56.018662+05:30
phase | performing two phase checkpoint

This is the output I see when the checkpointer process has come out of
the two phase checkpoint and is currently writing checkpoint xlog
records and doing other stuff like updating control files etc. Is this
okay?

The idea behind choosing the phases is based on the functionality
which takes longer time to execute. Since after two phase checkpoint
till post checkpoint cleanup won't take much time to execute, I have
not added any additional phase for that. But I also agree that this
gives wrong information to the user. How about mentioning the phase
information at the end of each phase like "Initializing",
"Initialization done", ..., "two phase checkpoint done", "post
checkpoint cleanup done", .., "finalizing". Except for the first phase
("initializing") and last phase ("finalizing"), all the other phases
describe the end of a certain operation. I feel this gives correct
information even though the phase name/description does not represent
the entire code block between two phases. For example if the current
phase is ''two phase checkpoint done". Then the user can infer that
the checkpointer has done till two phase checkpoint and it is doing
other stuff that are after that. Thoughts?
---

The output of log_checkpoint shows the number of buffers written is 3
whereas the output of pg_stat_progress_checkpoint shows it as 0. See
below:

2022-03-03 20:04:45.643 IST [22043] LOG: checkpoint complete: wrote 3
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB

--

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time
start_lsn | 0/14C60F8
start_time | 2022-03-03 18:59:56.018662+05:30
phase | finalizing
buffers_total | 0
buffers_processed | 0
buffers_written | 0

Any idea why this mismatch?

Good catch. In BufferSync() we have 'num_to_scan' (buffers_total)
which indicates the total number of buffers to be processed. Based on
that, the 'buffers_processed' and 'buffers_written' counter gets
incremented. I meant these values may reach upto 'buffers_total'. The
current pg_stat_progress_view support above information. There is
another place when 'ckpt_bufs_written' gets incremented (In
SlruInternalWritePage()). This increment is above the 'buffers_total'
value and it is included in the server log message (checkpoint end)
and not included in the view. I am a bit confused here. If we include
this increment in the view then we cannot calculate the exact
'buffers_total' beforehand. Can we increment the 'buffers_toal' also
when 'ckpt_bufs_written' gets incremented so that we can match the
behaviour with checkpoint end message? Please share your thoughts.
---

I think we can add a couple of more information to this view -
start_time for buffer write operation and start_time for buffer sync
operation. These are two very time consuming tasks in a checkpoint and
people would find it useful to know how much time is being taken by
the checkpoint in I/O operation phase. thoughts?

The detailed progress is getting shown for these 2 phases of the
checkpoint like 'buffers_processed', 'buffers_written' and
'files_synced'. Hence I did not think about adding start time and If
it is really required, then I can add.

Thanks & Regards,
Nitin Jadhav

On Thu, Mar 3, 2022 at 8:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Here are some of my review comments on the latest patch:
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of checkpoint. See <xref linkend="checkpoint-types"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>kind</structfield> <type>text</type>
+      </para>
+      <para>
+       Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
+      </para></entry>
+     </row>
This looks a bit confusing. Two columns, one with the name "checkpoint
types" and another "checkpoint kinds". You can probably rename
checkpoint-kinds to checkpoint-flags and let the checkpoint-types be
as-it-is.

==
+      <entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
+      <entry>One row only, showing the progress of the checkpoint.
Let's make this message consistent with the already existing message
for pg_stat_wal_receiver. See description for pg_stat_wal_receiver
view in "Dynamic Statistics Views" table.

==

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time

I think the output in the kind column can be displayed as {immediate,
force, wait, requested, time}. By the way these are all checkpoint
flags so it is better to display it as checkpoint flags instead of
checkpoint kind as mentioned in one of my previous comments.

==

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time
start_lsn | 0/14C60F8
start_time | 2022-03-03 18:59:56.018662+05:30
phase | performing two phase checkpoint

This is the output I see when the checkpointer process has come out of
the two phase checkpoint and is currently writing checkpoint xlog
records and doing other stuff like updating control files etc. Is this
okay?

==

The output of log_checkpoint shows the number of buffers written is 3
whereas the output of pg_stat_progress_checkpoint shows it as 0. See
below:

2022-03-03 20:04:45.643 IST [22043] LOG: checkpoint complete: wrote 3
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB

--

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time
start_lsn | 0/14C60F8
start_time | 2022-03-03 18:59:56.018662+05:30
phase | finalizing
buffers_total | 0
buffers_processed | 0
buffers_written | 0

Any idea why this mismatch?

==

I think we can add a couple of more information to this view -
start_time for buffer write operation and start_time for buffer sync
operation. These are two very time consuming tasks in a checkpoint and
people would find it useful to know how much time is being taken by
the checkpoint in I/O operation phase. thoughts?

--
With Regards,
Ashutosh Sharma.

On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
Thanks for reviewing.

I suggested upthread to store the starting timeline instead. This way you can
deduce whether it's a restartpoint or a checkpoint, but you can also deduce
other information, like what was the starting WAL.

I don't understand why we need the timeline here to just determine
whether it's a restartpoint or checkpoint.

I'm not saying it's necessary, I'm saying that for the same space usage we can
store something a bit more useful. If no one cares about having the starting
timeline available for no extra cost then sure, let's just store the kind
directly.

Fixed.
2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
directly instead of new function pg_stat_get_progress_checkpoint_kind?
+ snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
+ (flags == 0) ? "unknown" : "",
+ (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
+ (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
+ (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
+ (flags & CHECKPOINT_FORCE) ? "force " : "",
+ (flags & CHECKPOINT_WAIT) ? "wait " : "",
+ (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
+ (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
+ (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");
Fixed.
---

5) Do we need a special phase for this checkpoint operation? I'm not
sure in which cases it will take a long time, but it looks like
there's a wait loop here.
vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
if (nvxids > 0)
{
do
{
pg_usleep(10000L); /* wait for 10 msec */
} while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
}

Yes. It is better to add a separate phase here.
---

Also, how about special phases for SyncPostCheckpoint(),
SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
it might be increase in future (?)), TruncateSUBTRANS()?

SyncPreCheckpoint() is just incrementing a counter and
PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
there is no need to add any phases for these as of now. We can add in
the future if necessary. Added phases for SyncPostCheckpoint(),
InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().
---
6) SLRU (Simple LRU) isn't a phase here, you can just say
PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES.
+
+ pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+ PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
CheckPointPredicate();
And :s/checkpointing SLRU pages/checkpointing predicate lock pages
+ WHEN 9 THEN 'checkpointing SLRU pages'
Fixed.
---

7) :s/PROGRESS_CHECKPOINT_PHASE_FILE_SYNC/PROGRESS_CHECKPOINT_PHASE_PROCESS_FILE_SYNC_REQUESTS

I feel PROGRESS_CHECKPOINT_PHASE_FILE_SYNC is a better option here as
it describes the purpose in less words.

And :s/WHEN 11 THEN 'performing sync requests'/WHEN 11 THEN
'processing file sync requests'

Fixed.
---

8) :s/Finalizing/finalizing
+ WHEN 14 THEN 'Finalizing'

Fixed.
---
9) :s/checkpointing snapshots/checkpointing logical replication snapshot files
+                      WHEN 3 THEN 'checkpointing snapshots'
:s/checkpointing logical rewrite mappings/checkpointing logical
replication rewrite mapping files
+                      WHEN 4 THEN 'checkpointing logical rewrite mappings'
Fixed.
---

10) I'm not sure if it's discussed, how about adding the number of
snapshot/mapping files so far the checkpoint has processed in file
processing while loops of
CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
be many logical snapshot or mapping files and users may be interested
in knowing the so-far-processed-file-count.

I had thought about this while sharing the v1 patch and mentioned my
views upthread. I feel it won't give meaningful progress information
(It can be treated as statistics). Hence not included. Thoughts?

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.

Yeah, you are right. Let's not go that path and store an array of
pids. I don't see a strong use-case with the pid of the process
requesting checkpoint. If required, we can add it later once the
pg_stat_progress_checkpoint view gets in.

I don't think that's really necessary to give the pid list.

If you requested a new checkpoint, it doesn't matter if it's only your backend
that triggered it, another backend or a few other dozen, the result will be the
same and you have the information that the request has been seen. We could
store just a bool for that but having a number instead also gives a bit more
information and may allow you to detect some broken logic on your client code
if it keeps increasing.

It's a good metric to show in the view but the information is not
readily available. Additional code is required to calculate the number
of requests. Is it worth doing that? I feel this can be added later if
required.

Please find the v4 patch attached and share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Tue, Mar 1, 2022 at 2:27 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
3) Why do we need this extra calculation for start_lsn?
Do you ever see a negative LSN or something here?
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,
Yes: LSN can take up all of an uint64; whereas the pgstat column is a
bigint type; thus the signed int64. This cast is OK as it wraps
around, but that means we have to take care to correctly display the
LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
the special-casing for negative values.
Yes. The extra calculation is required here as we are storing unit64
value in the variable of type int64. When we convert uint64 to int64
then the bit pattern is preserved (so no data is lost). The high-order
bit becomes the sign bit and if the sign bit is set, both the sign and
magnitude of the value changes. To safely get the actual uint64 value
whatever was assigned, we need the above calculations.

4) Can't you use timestamptz_in(to_char(s.param4)) instead of
pg_stat_get_progress_checkpoint_start_time? I don't quite understand
the reasoning for having this function and it's named as *checkpoint*
when it doesn't do anything specific to the checkpoint at all?

I hadn't thought of using the types' inout functions, but it looks
like timestamp IO functions use a formatted timestring, which won't
work with the epoch-based timestamp stored in the view.

There is a variation of to_timestamp() which takes UNIX epoch (float8)
as an argument and converts it to timestamptz but we cannot directly
call this function with S.param4.

TimestampTz
GetCurrentTimestamp(void)
{
TimestampTz result;
struct timeval tp;

gettimeofday(&tp, NULL);

result = (TimestampTz) tp.tv_sec -
((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
result = (result * USECS_PER_SEC) + tp.tv_usec;

return result;
}

S.param4 contains the output of the above function
(GetCurrentTimestamp()) which returns Postgres epoch but the
to_timestamp() expects UNIX epoch as input. So some calculation is
required here. I feel the SQL 'to_timestamp(946684800 +
(S.param4::float / 1000000)) AS start_time' works fine. The value
'946684800' is equal to ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) *
SECS_PER_DAY). I am not sure whether it is good practice to use this
way. Kindly share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Mon, Feb 28, 2022 at 6:40 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Sun, 27 Feb 2022 at 16:14, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
3) Why do we need this extra calculation for start_lsn?
Do you ever see a negative LSN or something here?
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,
Yes: LSN can take up all of an uint64; whereas the pgstat column is a
bigint type; thus the signed int64. This cast is OK as it wraps
around, but that means we have to take care to correctly display the
LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
the special-casing for negative values.

As to whether it is reasonable: Generating 16GB of wal every second
(2^34 bytes /sec) is probably not impossible (cpu <> memory bandwidth
has been > 20GB/sec for a while); and that leaves you 2^29 seconds of
database runtime; or about 17 years. Seeing that a cluster can be
`pg_upgrade`d (which doesn't reset cluster LSN) since PG 9.0 from at
least version PG 8.4.0 (2009) (and through pg_migrator, from 8.3.0)),
we can assume that clusters hitting LSN=2^63 will be a reasonable
possibility within the next few years. As the lifespan of a PG release
is about 5 years, it doesn't seem impossible that there will be actual
clusters that are going to hit this naturally in the lifespan of PG15.

It is also possible that someone fat-fingers pg_resetwal; and creates
a cluster with LSN >= 2^63; resulting in negative values in the
s.param3 field. Not likely, but we can force such situations; and as
such we should handle that gracefully.

4) Can't you use timestamptz_in(to_char(s.param4)) instead of
pg_stat_get_progress_checkpoint_start_time? I don't quite understand
the reasoning for having this function and it's named as *checkpoint*
when it doesn't do anything specific to the checkpoint at all?

I hadn't thought of using the types' inout functions, but it looks
like timestamp IO functions use a formatted timestring, which won't
work with the epoch-based timestamp stored in the view.

Having 3 unnecessary functions that aren't useful to the users at all
in proc.dat will simply eatup the function oids IMO. Hence, I suggest
let's try to do without extra functions.

I agree that (1) could be simplified, or at least fully expressed in
SQL without exposing too many internals. If we're fine with exposing
internals like flags and type layouts, then (2), and arguably (4), can
be expressed in SQL as well.

-Matthias

#60

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#53)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.

It's a good metric to show in the view but the information is not
readily available. Additional code is required to calculate the number
of requests. Is it worth doing that? I feel this can be added later if
required.

Is it that hard or costly to do? Just sending a message to increment
the stat counter in RequestCheckpoint() would be enough.

Also, unless I'm missing something it's still only showing the initial
checkpoint flags, so it's *not* showing what the checkpoint is really
doing, only what the checkpoint may be doing if nothing else happens.
It just feels wrong. You could even use that ckpt_flags info to know
that at least one backend has requested a new checkpoint, if you don't
want to have a number of backends.

I just wanted to avoid extra calculations just to show the progress in
the view. Since it's a good metric, I have added an additional field
named 'next_flags' to the view which holds all possible flag values of
the next checkpoint. This gives more information than just saying
whether the new checkpoint is requested or not with the same memory. I
am updating the progress of 'next_flags' in
ImmediateCheckpointRequested() which gets called during buffer write
phase. I gave a thought to update the progress in other places also
but I feel updating in ImmediateCheckpointRequested() is enough as the
current checkpoint behaviour gets affected by only
CHECKPOINT_IMMEDIATE flag and all other checkpoint requests done in
case of createdb(), dropdb(), etc gets called with
CHECKPOINT_IMMEDIATE flag. I have updated this in the v5 patch. Please
share your thoughts.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Thu, Mar 3, 2022 at 11:58 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Wed, Mar 2, 2022 at 7:15 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

As mentioned upthread, there can be multiple backends that request a
checkpoint, so unless we want to store an array of pid we should store a number
of backend that are waiting for a new checkpoint.

It's a good metric to show in the view but the information is not
readily available. Additional code is required to calculate the number
of requests. Is it worth doing that? I feel this can be added later if
required.

Is it that hard or costly to do? Just sending a message to increment
the stat counter in RequestCheckpoint() would be enough.

Also, unless I'm missing something it's still only showing the initial
checkpoint flags, so it's *not* showing what the checkpoint is really
doing, only what the checkpoint may be doing if nothing else happens.
It just feels wrong. You could even use that ckpt_flags info to know
that at least one backend has requested a new checkpoint, if you don't
want to have a number of backends.

#61

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#59)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Tue, Mar 8, 2022 at 8:31 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time

I think the output in the kind column can be displayed as {immediate,
force, wait, requested, time}. By the way these are all checkpoint
flags so it is better to display it as checkpoint flags instead of
checkpoint kind as mentioned in one of my previous comments.

I will update in the next patch.

The current format matches with the server log message for the
checkpoint start in LogCheckpointStart(). Just to be consistent, I
have not changed the code.

See below, how flags are shown in other sql functions like:

ashu@postgres=# select * from heap_tuple_infomask_flags(2304, 1);
raw_flags | combined_flags
-----------------------------------------+----------------
{HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
(1 row)

This looks more readable and it's easy to understand for the
end-users.. Further comparing the way log messages are displayed with
the way sql functions display its output doesn't look like a right
comparison to me. Obviously both should show matching data but the way
it is shown doesn't need to be the same. In fact it is not in most of
the cases.

I have taken care of the rest of the comments in v5 patch for which
there was clarity.

Thank you very much. Will take a look at it later.

--
With Regards,
Ashutosh Sharma.

#62

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#60)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Tue, Mar 08, 2022 at 08:57:23PM +0530, Nitin Jadhav wrote:

I just wanted to avoid extra calculations just to show the progress in
the view. Since it's a good metric, I have added an additional field
named 'next_flags' to the view which holds all possible flag values of
the next checkpoint.

I still don't think that's ok. IIUC the only way to know if the current
checkpoint is throttled or not is to be aware that the "next_flags" can apply
to the current checkpoint too, look for it and see if that changes the
semantics of what the view say the current checkpoint is. Most users will get
it wrong.

This gives more information than just saying
whether the new checkpoint is requested or not with the same memory.

So that next_flags will be empty most of the time? It seems confusing.

Again I would just display a bool flag saying whether a new checkpoint has been
explicitly requested or not, it seems enough.

If you're interested in that next checkpoint, you probably want a quick
completion of the current checkpoint first (and thus need to know if it's
throttled or not). And then you will have to keep monitoring that view for the
next checkpoint anyway, and at that point the view will show the relevant
information.

#63

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#61)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

The current format matches with the server log message for the
checkpoint start in LogCheckpointStart(). Just to be consistent, I
have not changed the code.

See below, how flags are shown in other sql functions like:

ashu@postgres=# select * from heap_tuple_infomask_flags(2304, 1);
raw_flags | combined_flags
-----------------------------------------+----------------
{HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
(1 row)

This looks more readable and it's easy to understand for the
end-users.. Further comparing the way log messages are displayed with
the way sql functions display its output doesn't look like a right
comparison to me. Obviously both should show matching data but the way
it is shown doesn't need to be the same. In fact it is not in most of
the cases.

ok. I will take care in the next patch. I would like to handle this at
the SQL level in system_views.sql. The following can be used to
display in the format described above.

( '{' ||
CASE WHEN (S.param2 & 4) > 0 THEN 'immediate' ELSE '' END ||
CASE WHEN (S.param2 & 4) > 0 AND (S.param2 & -8) > 0 THEN ',
' ELSE '' END ||
CASE WHEN (S.param2 & 8) > 0 THEN 'force' ELSE '' END ||
CASE WHEN (S.param2 & 8) > 0 AND (S.param2 & -16) > 0 THEN
', ' ELSE '' END ||
CASE WHEN (S.param2 & 16) > 0 THEN 'flush-all' ELSE '' END ||
CASE WHEN (S.param2 & 16) > 0 AND (S.param2 & -32) > 0 THEN
', ' ELSE '' END ||
CASE WHEN (S.param2 & 32) > 0 THEN 'wait' ELSE '' END ||
CASE WHEN (S.param2 & 32) > 0 AND (S.param2 & -128) > 0 THEN
', ' ELSE '' END ||
CASE WHEN (S.param2 & 128) > 0 THEN 'wal' ELSE '' END ||
CASE WHEN (S.param2 & 128) > 0 AND (S.param2 & -256) > 0
THEN ', ' ELSE '' END ||
CASE WHEN (S.param2 & 256) > 0 THEN 'time' ELSE '' END
|| '}'

Basically, a separate CASE statement is used to decide whether a comma
has to be printed or not, which is done by checking whether the
previous flag bit is enabled (so that the appropriate flag has to be
displayed) and if any next bits are enabled (So there are some more
flags to be displayed). Kindly let me know if you know any other
better approach.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Wed, Mar 9, 2022 at 7:07 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Tue, Mar 8, 2022 at 8:31 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time

I think the output in the kind column can be displayed as {immediate,
force, wait, requested, time}. By the way these are all checkpoint
flags so it is better to display it as checkpoint flags instead of
checkpoint kind as mentioned in one of my previous comments.

I will update in the next patch.

The current format matches with the server log message for the
checkpoint start in LogCheckpointStart(). Just to be consistent, I
have not changed the code.

See below, how flags are shown in other sql functions like:

ashu@postgres=# select * from heap_tuple_infomask_flags(2304, 1);
raw_flags | combined_flags
-----------------------------------------+----------------
{HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
(1 row)

This looks more readable and it's easy to understand for the
end-users.. Further comparing the way log messages are displayed with
the way sql functions display its output doesn't look like a right
comparison to me. Obviously both should show matching data but the way
it is shown doesn't need to be the same. In fact it is not in most of
the cases.

I have taken care of the rest of the comments in v5 patch for which
there was clarity.

Thank you very much. Will take a look at it later.

--
With Regards,
Ashutosh Sharma.

#64

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#62)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

I just wanted to avoid extra calculations just to show the progress in
the view. Since it's a good metric, I have added an additional field
named 'next_flags' to the view which holds all possible flag values of
the next checkpoint.

I still don't think that's ok. IIUC the only way to know if the current
checkpoint is throttled or not is to be aware that the "next_flags" can apply
to the current checkpoint too, look for it and see if that changes the
semantics of what the view say the current checkpoint is. Most users will get
it wrong.

Again I would just display a bool flag saying whether a new checkpoint has been
explicitly requested or not, it seems enough.

Ok. I agree that it is difficult to interpret it correctly. So even if
say that a new checkpoint has been explicitly requested, the user may
not understand that it affects current checkpoint behaviour unless the
user knows the internals of the checkpoint. How about naming the field
to 'throttled' (Yes/ No) since our objective is to show that the
current checkpoint is throttled or not.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Wed, Mar 9, 2022 at 7:48 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Tue, Mar 08, 2022 at 08:57:23PM +0530, Nitin Jadhav wrote:

I just wanted to avoid extra calculations just to show the progress in
the view. Since it's a good metric, I have added an additional field
named 'next_flags' to the view which holds all possible flag values of
the next checkpoint.

I still don't think that's ok. IIUC the only way to know if the current
checkpoint is throttled or not is to be aware that the "next_flags" can apply
to the current checkpoint too, look for it and see if that changes the
semantics of what the view say the current checkpoint is. Most users will get
it wrong.

This gives more information than just saying
whether the new checkpoint is requested or not with the same memory.

So that next_flags will be empty most of the time? It seems confusing.

Again I would just display a bool flag saying whether a new checkpoint has been
explicitly requested or not, it seems enough.

If you're interested in that next checkpoint, you probably want a quick
completion of the current checkpoint first (and thus need to know if it's
throttled or not). And then you will have to keep monitoring that view for the
next checkpoint anyway, and at that point the view will show the relevant
information.

#65

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#64)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Fri, Mar 11, 2022 at 02:41:23PM +0530, Nitin Jadhav wrote:

Ok. I agree that it is difficult to interpret it correctly. So even if
say that a new checkpoint has been explicitly requested, the user may
not understand that it affects current checkpoint behaviour unless the
user knows the internals of the checkpoint. How about naming the field
to 'throttled' (Yes/ No) since our objective is to show that the
current checkpoint is throttled or not.

-1

That "throttled" flag should be the same as having or not a "force" in the
flags. We should be consistent and report information the same way, so either
a lot of flags (is_throttled, is_force...) or as now a single field containing
the set flags, so the current approach seems better. Also, it wouldn't be much
better to show the checkpoint as not having the force flags and still not being
throttled.

Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED |
CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the
view?

CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be used
to detect that someone wants a new checkpoint afterwards, whatever it's and
whether or not the current checkpoint to be finished quickly. For this flag I
think it's better to not report it in the view flags but with a new field, as
discussed before, as it's really what it means.

CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in
progress checkpoint, so it can be simply added to the view flags.

#66

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#65)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Ok. I agree that it is difficult to interpret it correctly. So even if
say that a new checkpoint has been explicitly requested, the user may
not understand that it affects current checkpoint behaviour unless the
user knows the internals of the checkpoint. How about naming the field
to 'throttled' (Yes/ No) since our objective is to show that the
current checkpoint is throttled or not.

-1

That "throttled" flag should be the same as having or not a "force" in the
flags. We should be consistent and report information the same way, so either
a lot of flags (is_throttled, is_force...) or as now a single field containing
the set flags, so the current approach seems better. Also, it wouldn't be much
better to show the checkpoint as not having the force flags and still not being
throttled.

I think your understanding is wrong here. The flag which affects
throttling behaviour is CHECKPOINT_IMMEDIATE. I am not suggesting
removing the existing 'flags' field of pg_stat_progress_checkpoint
view and adding a new field 'throttled'. The content of the 'flags'
field remains the same. I was suggesting replacing the 'next_flags'
field with 'throttled' field since the new request with
CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.

CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be used
to detect that someone wants a new checkpoint afterwards, whatever it's and
whether or not the current checkpoint to be finished quickly. For this flag I
think it's better to not report it in the view flags but with a new field, as
discussed before, as it's really what it means.

I understand your suggestion of adding a new field to indicate whether
any of the new requests have been made or not. You just want this
field to represent only a new request or does it also represent the
current checkpoint to finish quickly.

CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in
progress checkpoint, so it can be simply added to the view flags.

As discussed upthread this is not advisable to do so. The content of
'flags' remains the same through the checkpoint. We cannot add a new
checkpoint's flag (CHECKPOINT_IMMEDIATE ) to the current one even
though it affects current checkpoint behaviour. Only thing we can do
is to add a new field to show that the current checkpoint is affected
with new requests.

Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED |
CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the
view?

Where do you want to add this in the path?

I feel the new field name is confusing here.
'next_flags' - It shows all the flag values of the next checkpoint.
Based on this user can get to know that the new request has been made
and also if CHECKPOINT_IMMEDIATE is enabled here, then it indicates
that the current checkpoint also gets affected. You are not ok to use
this name as it confuses the user.
'throttled' - The value will be set to Yes/No based on the
CHECKPOINT_IMMEDIATE bit set in the new checkpoint request's flag.
This says that the current checkpoint is affected and also I thought
this is an indication that new requests have been made. But there is a
confusion here too. If the current checkpoint starts with
CHECKPOINT_IMMEDIATE which is described by the 'flags' field and there
is no new request, then the value of this field is 'Yes' (Not
throttling) which again confuses the user.
'new request' - The value will be set to Yes/No if any new checkpoint
requests. This just indicates whether new requests have been made or
not. It can not be used to infer other information.

Thought?

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Fri, Mar 11, 2022 at 3:34 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Fri, Mar 11, 2022 at 02:41:23PM +0530, Nitin Jadhav wrote:

Ok. I agree that it is difficult to interpret it correctly. So even if
say that a new checkpoint has been explicitly requested, the user may
not understand that it affects current checkpoint behaviour unless the
user knows the internals of the checkpoint. How about naming the field
to 'throttled' (Yes/ No) since our objective is to show that the
current checkpoint is throttled or not.

-1

That "throttled" flag should be the same as having or not a "force" in the
flags. We should be consistent and report information the same way, so either
a lot of flags (is_throttled, is_force...) or as now a single field containing
the set flags, so the current approach seems better. Also, it wouldn't be much
better to show the checkpoint as not having the force flags and still not being
throttled.

Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED |
CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the
view?

CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be used
to detect that someone wants a new checkpoint afterwards, whatever it's and
whether or not the current checkpoint to be finished quickly. For this flag I
think it's better to not report it in the view flags but with a new field, as
discussed before, as it's really what it means.

CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in
progress checkpoint, so it can be simply added to the view flags.

#67

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#66)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Fri, Mar 11, 2022 at 04:59:11PM +0530, Nitin Jadhav wrote:

That "throttled" flag should be the same as having or not a "force" in the
flags. We should be consistent and report information the same way, so either
a lot of flags (is_throttled, is_force...) or as now a single field containing
the set flags, so the current approach seems better. Also, it wouldn't be much
better to show the checkpoint as not having the force flags and still not being
throttled.

I think your understanding is wrong here. The flag which affects
throttling behaviour is CHECKPOINT_IMMEDIATE.

Yes sorry, that's what I meant and later used in the flags.

I am not suggesting
removing the existing 'flags' field of pg_stat_progress_checkpoint
view and adding a new field 'throttled'. The content of the 'flags'
field remains the same. I was suggesting replacing the 'next_flags'
field with 'throttled' field since the new request with
CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.

Are you saying that this new throttled flag will only be set by the overloaded
flags in ckpt_flags? So you can have a checkpoint with a CHECKPOINT_IMMEDIATE
flags that's throttled, and a checkpoint without the CHECKPOINT_IMMEDIATE flag
that's not throttled?

CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be used
to detect that someone wants a new checkpoint afterwards, whatever it's and
whether or not the current checkpoint to be finished quickly. For this flag I
think it's better to not report it in the view flags but with a new field, as
discussed before, as it's really what it means.

I understand your suggestion of adding a new field to indicate whether
any of the new requests have been made or not. You just want this
field to represent only a new request or does it also represent the
current checkpoint to finish quickly.

Only represent what it means: a new checkpoint is requested. An additional
CHECKPOINT_IMMEDIATE flag is orthogonal to this flag and this information.

CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in
progress checkpoint, so it can be simply added to the view flags.

As discussed upthread this is not advisable to do so. The content of
'flags' remains the same through the checkpoint. We cannot add a new
checkpoint's flag (CHECKPOINT_IMMEDIATE ) to the current one even
though it affects current checkpoint behaviour. Only thing we can do
is to add a new field to show that the current checkpoint is affected
with new requests.

I don't get it. The checkpoint flags and the view flags (set by
pgstat_progrss_update*) are different, so why can't we add this flag to the
view flags? The fact that checkpointer.c doesn't update the passed flag and
instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
an implementation detail, and the view shouldn't focus on which flags were
initially passed to the checkpointer but instead which flags the checkpointer
is actually enforcing, as that's what the user should be interested in. If you
want to store it in another field internally but display it in the view with
the rest of the flags, I'm fine with it.

Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED |
CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the
view?

Where do you want to add this in the path?

Same as in your current patch I guess.

#68

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#67)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

I am not suggesting
removing the existing 'flags' field of pg_stat_progress_checkpoint
view and adding a new field 'throttled'. The content of the 'flags'
field remains the same. I was suggesting replacing the 'next_flags'
field with 'throttled' field since the new request with
CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.

Are you saying that this new throttled flag will only be set by the overloaded
flags in ckpt_flags?

Yes. you are right.

So you can have a checkpoint with a CHECKPOINT_IMMEDIATE
flags that's throttled, and a checkpoint without the CHECKPOINT_IMMEDIATE flag
that's not throttled?

I think it's the reverse. A checkpoint with a CHECKPOINT_IMMEDIATE
flags that's not throttled (disables delays between writes) and a
checkpoint without the CHECKPOINT_IMMEDIATE flag that's throttled
(enables delays between writes)

CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be used
to detect that someone wants a new checkpoint afterwards, whatever it's and
whether or not the current checkpoint to be finished quickly. For this flag I
think it's better to not report it in the view flags but with a new field, as
discussed before, as it's really what it means.

I understand your suggestion of adding a new field to indicate whether
any of the new requests have been made or not. You just want this
field to represent only a new request or does it also represent the
current checkpoint to finish quickly.

Only represent what it means: a new checkpoint is requested. An additional
CHECKPOINT_IMMEDIATE flag is orthogonal to this flag and this information.

Thanks for the confirmation.

CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in
progress checkpoint, so it can be simply added to the view flags.

As discussed upthread this is not advisable to do so. The content of
'flags' remains the same through the checkpoint. We cannot add a new
checkpoint's flag (CHECKPOINT_IMMEDIATE ) to the current one even
though it affects current checkpoint behaviour. Only thing we can do
is to add a new field to show that the current checkpoint is affected
with new requests.

I don't get it. The checkpoint flags and the view flags (set by
pgstat_progrss_update*) are different, so why can't we add this flag to the
view flags? The fact that checkpointer.c doesn't update the passed flag and
instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
an implementation detail, and the view shouldn't focus on which flags were
initially passed to the checkpointer but instead which flags the checkpointer
is actually enforcing, as that's what the user should be interested in. If you
want to store it in another field internally but display it in the view with
the rest of the flags, I'm fine with it.

Just to be in sync with the way code behaves, it is better not to
update the next checkpoint request's CHECKPOINT_IMMEDIATE with the
current checkpoint 'flags' field. Because the current checkpoint
starts with a different set of flags and when there is a new request
(with CHECKPOINT_IMMEDIATE), it just processes the pending operations
quickly to take up next requests. If we update this information in the
'flags' field of the view, it says that the current checkpoint is
started with CHECKPOINT_IMMEDIATE which is not true. Hence I had
thought of adding a new field ('next flags' or 'upcoming flags') which
contain all the flag values of new checkpoint requests. This field
indicates whether the current checkpoint is throttled or not and also
it indicates there are new requests. Please share your thoughts. More
thoughts are welcomed.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Fri, Mar 11, 2022 at 5:43 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Fri, Mar 11, 2022 at 04:59:11PM +0530, Nitin Jadhav wrote:

That "throttled" flag should be the same as having or not a "force" in the
flags. We should be consistent and report information the same way, so either
a lot of flags (is_throttled, is_force...) or as now a single field containing
the set flags, so the current approach seems better. Also, it wouldn't be much
better to show the checkpoint as not having the force flags and still not being
throttled.

I think your understanding is wrong here. The flag which affects
throttling behaviour is CHECKPOINT_IMMEDIATE.

Yes sorry, that's what I meant and later used in the flags.

I am not suggesting
removing the existing 'flags' field of pg_stat_progress_checkpoint
view and adding a new field 'throttled'. The content of the 'flags'
field remains the same. I was suggesting replacing the 'next_flags'
field with 'throttled' field since the new request with
CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.

Are you saying that this new throttled flag will only be set by the overloaded
flags in ckpt_flags? So you can have a checkpoint with a CHECKPOINT_IMMEDIATE
flags that's throttled, and a checkpoint without the CHECKPOINT_IMMEDIATE flag
that's not throttled?

CHECKPOINT_REQUESTED will always be set by RequestCheckpoint(), and can be used
to detect that someone wants a new checkpoint afterwards, whatever it's and
whether or not the current checkpoint to be finished quickly. For this flag I
think it's better to not report it in the view flags but with a new field, as
discussed before, as it's really what it means.

I understand your suggestion of adding a new field to indicate whether
any of the new requests have been made or not. You just want this
field to represent only a new request or does it also represent the
current checkpoint to finish quickly.

Only represent what it means: a new checkpoint is requested. An additional
CHECKPOINT_IMMEDIATE flag is orthogonal to this flag and this information.

CHECKPOINT_IMMEDIATE is the only new flag that can be used in an already in
progress checkpoint, so it can be simply added to the view flags.

As discussed upthread this is not advisable to do so. The content of
'flags' remains the same through the checkpoint. We cannot add a new
checkpoint's flag (CHECKPOINT_IMMEDIATE ) to the current one even
though it affects current checkpoint behaviour. Only thing we can do
is to add a new field to show that the current checkpoint is affected
with new requests.

I don't get it. The checkpoint flags and the view flags (set by
pgstat_progrss_update*) are different, so why can't we add this flag to the
view flags? The fact that checkpointer.c doesn't update the passed flag and
instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
an implementation detail, and the view shouldn't focus on which flags were
initially passed to the checkpointer but instead which flags the checkpointer
is actually enforcing, as that's what the user should be interested in. If you
want to store it in another field internally but display it in the view with
the rest of the flags, I'm fine with it.

Why not just reporting (ckpt_flags & (CHECKPOINT_REQUESTED |
CHECKPOINT_IMMEDIATE)) in the path(s) that can update the new flags for the
view?

Where do you want to add this in the path?

Same as in your current patch I guess.

#69

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Nitin Jadhav (#68)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Mon, Mar 14, 2022 at 03:16:50PM +0530, Nitin Jadhav wrote:

I am not suggesting
removing the existing 'flags' field of pg_stat_progress_checkpoint
view and adding a new field 'throttled'. The content of the 'flags'
field remains the same. I was suggesting replacing the 'next_flags'
field with 'throttled' field since the new request with
CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.

Are you saying that this new throttled flag will only be set by the overloaded
flags in ckpt_flags?

Yes. you are right.

So you can have a checkpoint with a CHECKPOINT_IMMEDIATE
flags that's throttled, and a checkpoint without the CHECKPOINT_IMMEDIATE flag
that's not throttled?

I think it's the reverse. A checkpoint with a CHECKPOINT_IMMEDIATE
flags that's not throttled (disables delays between writes) and a
checkpoint without the CHECKPOINT_IMMEDIATE flag that's throttled
(enables delays between writes)

Yes that's how it's supposed to work, but my point was that your suggested
'throttled' flag could say the opposite, which is bad.

I don't get it. The checkpoint flags and the view flags (set by
pgstat_progrss_update*) are different, so why can't we add this flag to the
view flags? The fact that checkpointer.c doesn't update the passed flag and
instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
an implementation detail, and the view shouldn't focus on which flags were
initially passed to the checkpointer but instead which flags the checkpointer
is actually enforcing, as that's what the user should be interested in. If you
want to store it in another field internally but display it in the view with
the rest of the flags, I'm fine with it.

Just to be in sync with the way code behaves, it is better not to
update the next checkpoint request's CHECKPOINT_IMMEDIATE with the
current checkpoint 'flags' field. Because the current checkpoint
starts with a different set of flags and when there is a new request
(with CHECKPOINT_IMMEDIATE), it just processes the pending operations
quickly to take up next requests. If we update this information in the
'flags' field of the view, it says that the current checkpoint is
started with CHECKPOINT_IMMEDIATE which is not true.

Which is why I suggested to only take into account CHECKPOINT_REQUESTED (to
be able to display that a new checkpoint was requested) and
CHECKPOINT_IMMEDIATE, to be able to display that the current checkpoint isn't
throttled anymore if it were.

I still don't understand why you want so much to display "how the checkpoint
was initially started" rather than "how the checkpoint is really behaving right
now". The whole point of having a progress view is to have something dynamic
that reflects the current activity.

Hence I had
thought of adding a new field ('next flags' or 'upcoming flags') which
contain all the flag values of new checkpoint requests. This field
indicates whether the current checkpoint is throttled or not and also
it indicates there are new requests.

I'm not opposed to having such a field, I'm opposed to having a view with "the
current checkpoint is throttled but if there are some flags in the next
checkpoint flags and those flags contain checkpoint immediate then the current
checkpoint isn't actually throttled anymore" behavior.

#70

Nitin Jadhav

nitinjadhavpostgres@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#69)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

I don't get it. The checkpoint flags and the view flags (set by
pgstat_progrss_update*) are different, so why can't we add this flag to the
view flags? The fact that checkpointer.c doesn't update the passed flag and
instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
an implementation detail, and the view shouldn't focus on which flags were
initially passed to the checkpointer but instead which flags the checkpointer
is actually enforcing, as that's what the user should be interested in. If you
want to store it in another field internally but display it in the view with
the rest of the flags, I'm fine with it.

Just to be in sync with the way code behaves, it is better not to
update the next checkpoint request's CHECKPOINT_IMMEDIATE with the
current checkpoint 'flags' field. Because the current checkpoint
starts with a different set of flags and when there is a new request
(with CHECKPOINT_IMMEDIATE), it just processes the pending operations
quickly to take up next requests. If we update this information in the
'flags' field of the view, it says that the current checkpoint is
started with CHECKPOINT_IMMEDIATE which is not true.

Which is why I suggested to only take into account CHECKPOINT_REQUESTED (to
be able to display that a new checkpoint was requested)

I will take care in the next patch.

Hence I had
thought of adding a new field ('next flags' or 'upcoming flags') which
contain all the flag values of new checkpoint requests. This field
indicates whether the current checkpoint is throttled or not and also
it indicates there are new requests.

I'm not opposed to having such a field, I'm opposed to having a view with "the
current checkpoint is throttled but if there are some flags in the next
checkpoint flags and those flags contain checkpoint immediate then the current
checkpoint isn't actually throttled anymore" behavior.

I understand your point and I also agree that it becomes difficult for
the user to understand the context.

and
CHECKPOINT_IMMEDIATE, to be able to display that the current checkpoint isn't
throttled anymore if it were.

I still don't understand why you want so much to display "how the checkpoint
was initially started" rather than "how the checkpoint is really behaving right
now". The whole point of having a progress view is to have something dynamic
that reflects the current activity.

As of now I will not consider adding this information to the view. If
required and nobody opposes having this included in the 'flags' field
of the view, then I will consider adding.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Mon, Mar 14, 2022 at 5:16 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Mon, Mar 14, 2022 at 03:16:50PM +0530, Nitin Jadhav wrote:

I am not suggesting
removing the existing 'flags' field of pg_stat_progress_checkpoint
view and adding a new field 'throttled'. The content of the 'flags'
field remains the same. I was suggesting replacing the 'next_flags'
field with 'throttled' field since the new request with
CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.

Are you saying that this new throttled flag will only be set by the overloaded
flags in ckpt_flags?

Yes. you are right.

So you can have a checkpoint with a CHECKPOINT_IMMEDIATE
flags that's throttled, and a checkpoint without the CHECKPOINT_IMMEDIATE flag
that's not throttled?

I think it's the reverse. A checkpoint with a CHECKPOINT_IMMEDIATE
flags that's not throttled (disables delays between writes) and a
checkpoint without the CHECKPOINT_IMMEDIATE flag that's throttled
(enables delays between writes)

Yes that's how it's supposed to work, but my point was that your suggested
'throttled' flag could say the opposite, which is bad.

I don't get it. The checkpoint flags and the view flags (set by
pgstat_progrss_update*) are different, so why can't we add this flag to the
view flags? The fact that checkpointer.c doesn't update the passed flag and
instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
an implementation detail, and the view shouldn't focus on which flags were
initially passed to the checkpointer but instead which flags the checkpointer
is actually enforcing, as that's what the user should be interested in. If you
want to store it in another field internally but display it in the view with
the rest of the flags, I'm fine with it.

Just to be in sync with the way code behaves, it is better not to
update the next checkpoint request's CHECKPOINT_IMMEDIATE with the
current checkpoint 'flags' field. Because the current checkpoint
starts with a different set of flags and when there is a new request
(with CHECKPOINT_IMMEDIATE), it just processes the pending operations
quickly to take up next requests. If we update this information in the
'flags' field of the view, it says that the current checkpoint is
started with CHECKPOINT_IMMEDIATE which is not true.

Which is why I suggested to only take into account CHECKPOINT_REQUESTED (to
be able to display that a new checkpoint was requested) and
CHECKPOINT_IMMEDIATE, to be able to display that the current checkpoint isn't
throttled anymore if it were.

I still don't understand why you want so much to display "how the checkpoint
was initially started" rather than "how the checkpoint is really behaving right
now". The whole point of having a progress view is to have something dynamic
that reflects the current activity.

Hence I had
thought of adding a new field ('next flags' or 'upcoming flags') which
contain all the flag values of new checkpoint requests. This field
indicates whether the current checkpoint is throttled or not and also
it indicates there are new requests.

I'm not opposed to having such a field, I'm opposed to having a view with "the
current checkpoint is throttled but if there are some flags in the next
checkpoint flags and those flags contain checkpoint immediate then the current
checkpoint isn't actually throttled anymore" behavior.

#71

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Nitin Jadhav (#58)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

This is a long thread, sorry for asking if this has been asked before.

On 2022-03-08 20:25:28 +0530, Nitin Jadhav wrote:

* Sort buffers that need to be written to reduce the likelihood of random
@@ -2129,6 +2132,8 @@ BufferSync(int flags)
bufHdr = GetBufferDescriptor(buf_id);

num_processed++;
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_PROCESSED,
+									 num_processed);

/*
* We don't need to acquire the lock here, because we're only looking
@@ -2149,6 +2154,8 @@ BufferSync(int flags)
TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
PendingCheckpointerStats.m_buf_written_checkpoints++;
num_written++;
+				pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_WRITTEN,
+											 num_written);
}
}

Have you measured the performance effects of this? On fast storage with large
shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
be good to verify that.

@@ -1897,6 +1897,112 @@ pg_stat_progress_basebackup| SELECT s.pid,
s.param4 AS tablespaces_total,
s.param5 AS tablespaces_streamed
FROM pg_stat_get_progress_info('BASEBACKUP'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
+pg_stat_progress_checkpoint| SELECT s.pid,
+        CASE s.param1
+            WHEN 1 THEN 'checkpoint'::text
+            WHEN 2 THEN 'restartpoint'::text
+            ELSE NULL::text
+        END AS type,
+    (((((((
+        CASE
+            WHEN ((s.param2 & (1)::bigint) > 0) THEN 'shutdown '::text
+            ELSE ''::text
+        END ||
+        CASE
+            WHEN ((s.param2 & (2)::bigint) > 0) THEN 'end-of-recovery '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (4)::bigint) > 0) THEN 'immediate '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (8)::bigint) > 0) THEN 'force '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (16)::bigint) > 0) THEN 'flush-all '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (32)::bigint) > 0) THEN 'wait '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (128)::bigint) > 0) THEN 'wal '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (256)::bigint) > 0) THEN 'time '::text
+            ELSE ''::text
+        END) AS flags,
+    (((((((
+        CASE
+            WHEN ((s.param3 & (1)::bigint) > 0) THEN 'shutdown '::text
+            ELSE ''::text
+        END ||
+        CASE
+            WHEN ((s.param3 & (2)::bigint) > 0) THEN 'end-of-recovery '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (4)::bigint) > 0) THEN 'immediate '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (8)::bigint) > 0) THEN 'force '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (16)::bigint) > 0) THEN 'flush-all '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (32)::bigint) > 0) THEN 'wait '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (128)::bigint) > 0) THEN 'wal '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (256)::bigint) > 0) THEN 'time '::text
+            ELSE ''::text
+        END) AS next_flags,
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param4 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param4)::numeric)) AS start_lsn,
+    to_timestamp(((946684800)::double precision + ((s.param5)::double precision / (1000000)::double precision))) AS start_time,
+        CASE s.param6
+            WHEN 1 THEN 'initializing'::text
+            WHEN 2 THEN 'getting virtual transaction IDs'::text
+            WHEN 3 THEN 'checkpointing replication slots'::text
+            WHEN 4 THEN 'checkpointing logical replication snapshot files'::text
+            WHEN 5 THEN 'checkpointing logical rewrite mapping files'::text
+            WHEN 6 THEN 'checkpointing replication origin'::text
+            WHEN 7 THEN 'checkpointing commit log pages'::text
+            WHEN 8 THEN 'checkpointing commit time stamp pages'::text
+            WHEN 9 THEN 'checkpointing subtransaction pages'::text
+            WHEN 10 THEN 'checkpointing multixact pages'::text
+            WHEN 11 THEN 'checkpointing predicate lock pages'::text
+            WHEN 12 THEN 'checkpointing buffers'::text
+            WHEN 13 THEN 'processing file sync requests'::text
+            WHEN 14 THEN 'performing two phase checkpoint'::text
+            WHEN 15 THEN 'performing post checkpoint cleanup'::text
+            WHEN 16 THEN 'invalidating replication slots'::text
+            WHEN 17 THEN 'recycling old WAL files'::text
+            WHEN 18 THEN 'truncating subtransactions'::text
+            WHEN 19 THEN 'finalizing'::text
+            ELSE NULL::text
+        END AS phase,
+    s.param7 AS buffers_total,
+    s.param8 AS buffers_processed,
+    s.param9 AS buffers_written,
+    s.param10 AS files_total,
+    s.param11 AS files_synced
+   FROM pg_stat_get_progress_info('CHECKPOINT'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
pg_stat_progress_cluster| SELECT s.pid,
s.datid,
d.datname,

This view is depressingly complicated. Added up the view definitions for
the already existing pg_stat_progress* views add up to a measurable part of
the size of an empty database:

postgres[1160866][1]=# SELECT sum(octet_length(ev_action)), SUM(pg_column_size(ev_action)) FROM pg_rewrite WHERE ev_class::regclass::text LIKE '%progress%';
┌───────┬───────┐
│ sum │ sum │
├───────┼───────┤
│ 97410 │ 19786 │
└───────┴───────┘
(1 row)

and this view looks to be a good bit more complicated than the existing
pg_stat_progress* views.

Indeed:
template1[1165473][1]=# SELECT ev_class::regclass, length(ev_action), pg_column_size(ev_action) FROM pg_rewrite WHERE ev_class::regclass::text LIKE '%progress%' ORDER BY length(ev_action) DESC;
┌───────────────────────────────┬────────┬────────────────┐
│ ev_class │ length │ pg_column_size │
├───────────────────────────────┼────────┼────────────────┤
│ pg_stat_progress_checkpoint │ 43290 │ 5409 │
│ pg_stat_progress_create_index │ 23293 │ 4177 │
│ pg_stat_progress_cluster │ 18390 │ 3704 │
│ pg_stat_progress_analyze │ 16121 │ 3339 │
│ pg_stat_progress_vacuum │ 16076 │ 3392 │
│ pg_stat_progress_copy │ 15124 │ 3080 │
│ pg_stat_progress_basebackup │ 8406 │ 2094 │
└───────────────────────────────┴────────┴────────────────┘
(7 rows)

pg_rewrite without pg_stat_progress_checkpoint: 745472, with: 753664

pg_rewrite is the second biggest relation in an empty database already...

template1[1164827][1]=# SELECT relname, pg_total_relation_size(oid) FROM pg_class WHERE relkind = 'r' ORDER BY 2 DESC LIMIT 5;
┌────────────────┬────────────────────────┐
│ relname │ pg_total_relation_size │
├────────────────┼────────────────────────┤
│ pg_proc │ 1212416 │
│ pg_rewrite │ 745472 │
│ pg_attribute │ 704512 │
│ pg_description │ 630784 │
│ pg_collation │ 409600 │
└────────────────┴────────────────────────┘
(5 rows)

Greetings,

Andres Freund

#72

Michael Paquier

michael@paquier.xyz

almost 4 years ago

In reply to: Andres Freund (#71)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Fri, Mar 18, 2022 at 05:15:56PM -0700, Andres Freund wrote:

Have you measured the performance effects of this? On fast storage with large
shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
be good to verify that.

I am wondering if we could make the function inlined at some point.
We could also play it safe and only update the counters every N loops
instead.

This view is depressingly complicated. Added up the view definitions for
the already existing pg_stat_progress* views add up to a measurable part of
the size of an empty database:

Yeah. I think that what's proposed could be simplified, and we had
better remove the fields that are not that useful. First, do we have
any need for next_flags? Second, is the start LSN really necessary
for monitoring purposes? Not all the information in the first
parameter is useful, as well. For example "shutdown" will never be
seen as it is not possible to use a session at this stage, no? There
is also no gain in having "immediate", "flush-all", "force" and "wait"
(for this one if the checkpoint is requested the session doing the
work knows this information already).

A last thing is that we may gain in visibility by having more
attributes as an effect of splitting param2. On thing that would make
sense is to track the reason why the checkpoint was triggered
separately (aka wal and time). Should we use a text[] instead to list
all the parameters instead? Using a space-separated list of items is
not intuitive IMO, and callers of this routine will likely parse
that.

Shouldn't we also track the number of files flushed in each sub-step?
In some deployments you could have a large number of 2PC files and
such. We may want more information on such matters.

+                      WHEN 3 THEN 'checkpointing replication slots'
+                      WHEN 4 THEN 'checkpointing logical replication snapshot files'
+                      WHEN 5 THEN 'checkpointing logical rewrite mapping files'
+                      WHEN 6 THEN 'checkpointing replication origin'
+                      WHEN 7 THEN 'checkpointing commit log pages'
+                      WHEN 8 THEN 'checkpointing commit time stamp pages'
There is a lot of "checkpointing" here.  All those terms could be
shorter without losing their meaning.

This patch still needs some work, so I am marking it as RwF for now.
--
Michael

#73

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 4 years ago

In reply to: Andres Freund (#71)

1 attachment(s)

Size of pg_rewrite (Was: Report checkpoint progress with pg_stat_progress_checkpoint)

On Sat, 19 Mar 2022 at 01:15, Andres Freund <andres@anarazel.de> wrote:

pg_rewrite without pg_stat_progress_checkpoint: 745472, with: 753664

pg_rewrite is the second biggest relation in an empty database already...

Yeah, that's not great. Thanks for nerd-sniping me into looking into
how views and pg_rewrite rules work, that was very interesting and I
learned quite a lot.

# Immediately potential, limited to progress views

I noticed that the CASE-WHEN (used in translating progress stage index
to stage names) in those progress reporting views can be more
efficiently described (althoug with slightly worse behaviour around
undefined values) using text array lookups (as attached). That
resulted in somewhat smaller rewrite entries for the progress views
(toast compression was good old pglz):

template1=# SELECT sum(octet_length(ev_action)),
SUM(pg_column_size(ev_action)) FROM pg_rewrite WHERE
ev_class::regclass::text LIKE '%progress%';

master:
sum | sum
-------+-------
97277 | 19956
patched:
sum | sum
-------+-------
77069 | 18417

So this seems like a nice improvement of 20% uncompressed / 7% compressed.

I tested various cases of phase number to text translations: `CASE ..
WHEN`; `(ARRAY[]::text[])[index]` and `('{}'::text[])[index]`. See
results below:

postgres=# create or replace view arrayliteral_view as select
(ARRAY['a','b','c','d','e','f']::text[])[index] as name from tst
s(index);
CREATE INDEX
postgres=# create or replace view stringcast_view as select
('{a,b,c,d,e,f}'::text[])[index] as name from tst s(index);
CREATE INDEX
postgres=# create or replace view split_stringcast_view as select
(('{a,b,' || 'c,d,e,f}')::text[])[index] as name from tst s(index);
CREATE VIEW
postgres=# create or replace view case_view as select case index when
0 then 'a' when 1 then 'b' when 2 then 'c' when 3 then 'd' when 4 then
'e' when 5 then 'f' end as name from tst s(index);
CREATE INDEX

postgres=# postgres=# select ev_class::regclass::text,
octet_length(ev_action), pg_column_size(ev_action) from pg_rewrite
where ev_class in ('arrayliteral_view'::regclass::oid,
'case_view'::regclass::oid, 'split_stringcast_view'::regclass::oid,
'stringcast_view'::regclass::oid);
ev_class | octet_length | pg_column_size
-----------------------+--------------+----------------
arrayliteral_view | 3311 | 1322
stringcast_view | 2610 | 1257
case_view | 5170 | 1412
split_stringcast_view | 2847 | 1350

It seems to me that we could consider replacing the CASE statements
with array literals and lookups if we really value our template
database size. But, as text literal concatenations don't seem to get
constant folded before storing them in the rules table, this rewrite
of the views would result in long lines in the system_views.sql file,
or we'd have to deal with the additional overhead of the append
operator and cast nodes.

# Future work; nodeToString / readNode, all rewrite rules

Additionally, we might want to consider other changes like default (or
empty value) elision in nodeToString, if that is considered a
reasonable option and if we really want to reduce the size of the
pg_rewrite table.

I think a lot of space can be recovered from that: A manual removal of
what seemed to be fields with default values (and the removal of all
query location related fields) in the current definition of
pg_stat_progress_create_index reduces its uncompressed size from
23226B raw and 4204B compressed to 13821B raw and 2784B compressed,
for an on-disk space saving of 33% for this view's ev_action.

Do note, however, that that would add significant branching in the
nodeToString and readNode code, which might slow down that code
significantly. I'm not planning on working on that; but in my opinion
that is a viable path to reducing the size of new database catalogs.

-Matthias

PS. attached patch is not to be considered complete - it is a minimal
example of the array literal form. It fails regression tests because I
didn't bother updating or including the regression tests on system
views.

Attachments:

v0-Replace-system-view-CASE-WHEN-with-array-lookups.patch.txttext/plain; charset=US-ASCII; name=v0-Replace-system-view-CASE-WHEN-with-array-lookups.patch.txtDownload

Index: src/backend/catalog/system_views.sql
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
--- a/src/backend/catalog/system_views.sql	(revision 32723e5fabcc7db1bf4e897baaf0d251b500c1dc)
+++ b/src/backend/catalog/system_views.sql	(date 1649160138886)
@@ -1120,13 +1120,7 @@
     SELECT
         S.pid AS pid, S.datid AS datid, D.datname AS datname,
         CAST(S.relid AS oid) AS relid,
-        CASE S.param1 WHEN 0 THEN 'initializing'
-                      WHEN 1 THEN 'acquiring sample rows'
-                      WHEN 2 THEN 'acquiring inherited sample rows'
-                      WHEN 3 THEN 'computing statistics'
-                      WHEN 4 THEN 'computing extended statistics'
-                      WHEN 5 THEN 'finalizing analyze'
-                      END AS phase,
+        (('{initializing,acquiring sample rows,acquiring inherited sample rows,computing statistics,computing extended statistics,finalizing analyze}')::text[])[S.param1] AS phase,
         S.param2 AS sample_blks_total,
         S.param3 AS sample_blks_scanned,
         S.param4 AS ext_stats_total,
@@ -1141,14 +1135,8 @@
     SELECT
         S.pid AS pid, S.datid AS datid, D.datname AS datname,
         S.relid AS relid,
-        CASE S.param1 WHEN 0 THEN 'initializing'
-                      WHEN 1 THEN 'scanning heap'
-                      WHEN 2 THEN 'vacuuming indexes'
-                      WHEN 3 THEN 'vacuuming heap'
-                      WHEN 4 THEN 'cleaning up indexes'
-                      WHEN 5 THEN 'truncating heap'
-                      WHEN 6 THEN 'performing final cleanup'
-                      END AS phase,
+        (('{initializing,scanning heap,vacuuming indexes,vacuuming heap,cleaning up indexes,truncating heap,performing final cleanup}')::text[]
+        )[S.param1] AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
         S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
@@ -1161,18 +1149,8 @@
         S.datid AS datid,
         D.datname AS datname,
         S.relid AS relid,
-        CASE S.param1 WHEN 1 THEN 'CLUSTER'
-                      WHEN 2 THEN 'VACUUM FULL'
-                      END AS command,
-        CASE S.param2 WHEN 0 THEN 'initializing'
-                      WHEN 1 THEN 'seq scanning heap'
-                      WHEN 2 THEN 'index scanning heap'
-                      WHEN 3 THEN 'sorting tuples'
-                      WHEN 4 THEN 'writing new heap'
-                      WHEN 5 THEN 'swapping relation files'
-                      WHEN 6 THEN 'rebuilding index'
-                      WHEN 7 THEN 'performing final cleanup'
-                      END AS phase,
+        ('{NULL,CLUSTER,VACUUM FULL}'::text[])[S.param1] AS command,
+        (('{initializing,initializing,seq scanning heap,index scanning heap,sorting tuples,writing new heap,swapping relation files,rebuilding index,performing final cleanup}')::text[])[S.param2] AS phase,
         CAST(S.param3 AS oid) AS cluster_index_relid,
         S.param4 AS heap_tuples_scanned,
         S.param5 AS heap_tuples_written,
@@ -1187,24 +1165,12 @@
         S.pid AS pid, S.datid AS datid, D.datname AS datname,
         S.relid AS relid,
         CAST(S.param7 AS oid) AS index_relid,
-        CASE S.param1 WHEN 1 THEN 'CREATE INDEX'
-                      WHEN 2 THEN 'CREATE INDEX CONCURRENTLY'
-                      WHEN 3 THEN 'REINDEX'
-                      WHEN 4 THEN 'REINDEX CONCURRENTLY'
-                      END AS command,
-        CASE S.param10 WHEN 0 THEN 'initializing'
-                       WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+        ('{NULL,CREATE INDEX,CREATE INDEX CONCURRENTLY,REINDEX,REINDEX CONCURRENTLY}'::text[])[S.param1] AS command,
+        CASE S.param10 WHEN 2 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
-                       END as phase,
+                       ELSE ('{NULL,initializing,waiting for writers before build,NULL,waiting for writers before validation,index validation: scanning index,index validation: sorting tuples,index validation: scanning table,waiting for old snapshots,waiting for readers before marking dead,waiting for readers before dropping}'::text[])[S.param10]
+                       END AS phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
         S.param6 AS current_locker_pid,
@@ -1220,13 +1186,8 @@
 CREATE VIEW pg_stat_progress_basebackup AS
     SELECT
         S.pid AS pid,
-        CASE S.param1 WHEN 0 THEN 'initializing'
-                      WHEN 1 THEN 'waiting for checkpoint to finish'
-                      WHEN 2 THEN 'estimating backup size'
-                      WHEN 3 THEN 'streaming database files'
-                      WHEN 4 THEN 'waiting for wal archiving to finish'
-                      WHEN 5 THEN 'transferring wal files'
-                      END AS phase,
+        ('{NULL,initializing,waiting for checkpoint to finish,estimating backup size,streaming database files,waiting for wal archiving to finish,transferring wal files}'::text[]
+        )[S.param1] AS phase,
         CASE S.param2 WHEN -1 THEN NULL ELSE S.param2 END AS backup_total,
         S.param3 AS backup_streamed,
         S.param4 AS tablespaces_total,
@@ -1238,14 +1199,8 @@
     SELECT
         S.pid AS pid, S.datid AS datid, D.datname AS datname,
         S.relid AS relid,
-        CASE S.param5 WHEN 1 THEN 'COPY FROM'
-                      WHEN 2 THEN 'COPY TO'
-                      END AS command,
-        CASE S.param6 WHEN 1 THEN 'FILE'
-                      WHEN 2 THEN 'PROGRAM'
-                      WHEN 3 THEN 'PIPE'
-                      WHEN 4 THEN 'CALLBACK'
-                      END AS "type",
+        ('{NULL,COPY FROM,COPY TO}'::text[])[S.param5] AS command,
+        ('{NULL,FILE,PROGRAM,PIPE,CALLBACK}'::text[])[S.param6] AS "type",
         S.param1 AS bytes_processed,
         S.param2 AS bytes_total,
         S.param3 AS tuples_processed,

#74

Dagfinn Ilmari Mannsåker

ilmari@ilmari.org

almost 4 years ago

In reply to: Matthias van de Meent (#73)

Re: Size of pg_rewrite

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:

But, as text literal concatenations don't seem to get constant folded
before storing them in the rules table, this rewrite of the views
would result in long lines in the system_views.sql file, or we'd have
to deal with the additional overhead of the append operator and cast
nodes.

There is no need to use the concatenation operator to split array
constants across multiple lines. Newlines are fine either inside the
string (between array elements), or between two string string literals
(which become one string constant at parse time).

https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS

ilmari@[local]:5432 ~=# select
ilmari@[local]:5432 ~-# '{foo,
ilmari@[local]:5432 ~'# bar}'::text[],
ilmari@[local]:5432 ~-# '{bar,'
ilmari@[local]:5432 ~-# 'baz}'::text[];
┌───────────┬───────────┐
│ text │ text │
├───────────┼───────────┤
│ {foo,bar} │ {bar,baz} │
└───────────┴───────────┘
(1 row)

- ilmari

#75

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 4 years ago

In reply to: Dagfinn Ilmari Mannsåker (#74)

Re: Size of pg_rewrite

On Fri, 8 Apr 2022 at 17:20, Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> wrote:

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:

But, as text literal concatenations don't seem to get constant folded
before storing them in the rules table, this rewrite of the views
would result in long lines in the system_views.sql file, or we'd have
to deal with the additional overhead of the append operator and cast
nodes.

There is no need to use the concatenation operator to split array
constants across multiple lines. Newlines are fine either inside the
string (between array elements), or between two string string literals
(which become one string constant at parse time).

Ah, neat, that saves some long lines in the system_views file. I had
already tried the "auto-concatenate two consecutive string literals",
but that try failed in initdb, so now I'm not sure what happened
there.

Thanks!

-Matthias

#76

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Matthias van de Meent (#73)

Re: Size of pg_rewrite (Was: Report checkpoint progress with pg_stat_progress_checkpoint)

Hi,

On April 8, 2022 7:52:07 AM PDT, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:

On Sat, 19 Mar 2022 at 01:15, Andres Freund <andres@anarazel.de> wrote:

pg_rewrite without pg_stat_progress_checkpoint: 745472, with: 753664

pg_rewrite is the second biggest relation in an empty database already...

Yeah, that's not great. Thanks for nerd-sniping me into looking into
how views and pg_rewrite rules work, that was very interesting and I
learned quite a lot.

Thanks for looking!

# Immediately potential, limited to progress views

I noticed that the CASE-WHEN (used in translating progress stage index
to stage names) in those progress reporting views can be more
efficiently described (althoug with slightly worse behaviour around
undefined values) using text array lookups (as attached). That
resulted in somewhat smaller rewrite entries for the progress views
(toast compression was good old pglz):

template1=# SELECT sum(octet_length(ev_action)),
SUM(pg_column_size(ev_action)) FROM pg_rewrite WHERE
ev_class::regclass::text LIKE '%progress%';

master:
sum | sum
-------+-------
97277 | 19956
patched:
sum | sum
-------+-------
77069 | 18417

So this seems like a nice improvement of 20% uncompressed / 7% compressed.

I tested various cases of phase number to text translations: `CASE ..
WHEN`; `(ARRAY[]::text[])[index]` and `('{}'::text[])[index]`. See
results below:

postgres=# create or replace view arrayliteral_view as select
(ARRAY['a','b','c','d','e','f']::text[])[index] as name from tst
s(index);
CREATE INDEX
postgres=# create or replace view stringcast_view as select
('{a,b,c,d,e,f}'::text[])[index] as name from tst s(index);
CREATE INDEX
postgres=# create or replace view split_stringcast_view as select
(('{a,b,' || 'c,d,e,f}')::text[])[index] as name from tst s(index);
CREATE VIEW
postgres=# create or replace view case_view as select case index when
0 then 'a' when 1 then 'b' when 2 then 'c' when 3 then 'd' when 4 then
'e' when 5 then 'f' end as name from tst s(index);
CREATE INDEX

postgres=# postgres=# select ev_class::regclass::text,
octet_length(ev_action), pg_column_size(ev_action) from pg_rewrite
where ev_class in ('arrayliteral_view'::regclass::oid,
'case_view'::regclass::oid, 'split_stringcast_view'::regclass::oid,
'stringcast_view'::regclass::oid);
ev_class | octet_length | pg_column_size
-----------------------+--------------+----------------
arrayliteral_view | 3311 | 1322
stringcast_view | 2610 | 1257
case_view | 5170 | 1412
split_stringcast_view | 2847 | 1350

It seems to me that we could consider replacing the CASE statements
with array literals and lookups if we really value our template
database size. But, as text literal concatenations don't seem to get
constant folded before storing them in the rules table, this rewrite
of the views would result in long lines in the system_views.sql file,
or we'd have to deal with the additional overhead of the append
operator and cast nodes.

My inclination is that the mapping functions should be c functions. There's really no point in doing it in SQL and it comes at a noticable price. And, if done in C, we can fix mistakes in minor releases, which we can't in SQL.

# Future work; nodeToString / readNode, all rewrite rules

Additionally, we might want to consider other changes like default (or
empty value) elision in nodeToString, if that is considered a
reasonable option and if we really want to reduce the size of the
pg_rewrite table.

I think a lot of space can be recovered from that: A manual removal of
what seemed to be fields with default values (and the removal of all
query location related fields) in the current definition of
pg_stat_progress_create_index reduces its uncompressed size from
23226B raw and 4204B compressed to 13821B raw and 2784B compressed,
for an on-disk space saving of 33% for this view's ev_action.

Do note, however, that that would add significant branching in the
nodeToString and readNode code, which might slow down that code
significantly. I'm not planning on working on that; but in my opinion
that is a viable path to reducing the size of new database catalogs.

We should definitely be careful about that. I do agree that there's a lot of efficiency to be gained in the serialization format. Once we have the automatic node func generation in place, we could have one representation for human consumption, and one for density...

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#77

Nitin Jadhav

nitinjadhavpostgres@gmail.com

over 3 years ago

In reply to: Nitin Jadhav (#70)

1 attachment(s)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

Here is the update patch which fixes the previous comments discussed
in this thread. I am sorry for the long gap in the discussion. Kindly
let me know if I have missed any of the comments or anything new.

Thanks & Regards,
Nitin Jadhav

On Fri, Mar 18, 2022 at 4:52 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Show quoted text

I don't get it. The checkpoint flags and the view flags (set by
pgstat_progrss_update*) are different, so why can't we add this flag to the
view flags? The fact that checkpointer.c doesn't update the passed flag and
instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
an implementation detail, and the view shouldn't focus on which flags were
initially passed to the checkpointer but instead which flags the checkpointer
is actually enforcing, as that's what the user should be interested in. If you
want to store it in another field internally but display it in the view with
the rest of the flags, I'm fine with it.

Just to be in sync with the way code behaves, it is better not to
update the next checkpoint request's CHECKPOINT_IMMEDIATE with the
current checkpoint 'flags' field. Because the current checkpoint
starts with a different set of flags and when there is a new request
(with CHECKPOINT_IMMEDIATE), it just processes the pending operations
quickly to take up next requests. If we update this information in the
'flags' field of the view, it says that the current checkpoint is
started with CHECKPOINT_IMMEDIATE which is not true.

Which is why I suggested to only take into account CHECKPOINT_REQUESTED (to
be able to display that a new checkpoint was requested)

I will take care in the next patch.

Hence I had
thought of adding a new field ('next flags' or 'upcoming flags') which
contain all the flag values of new checkpoint requests. This field
indicates whether the current checkpoint is throttled or not and also
it indicates there are new requests.

I'm not opposed to having such a field, I'm opposed to having a view with "the
current checkpoint is throttled but if there are some flags in the next
checkpoint flags and those flags contain checkpoint immediate then the current
checkpoint isn't actually throttled anymore" behavior.

I understand your point and I also agree that it becomes difficult for
the user to understand the context.

and
CHECKPOINT_IMMEDIATE, to be able to display that the current checkpoint isn't
throttled anymore if it were.

I still don't understand why you want so much to display "how the checkpoint
was initially started" rather than "how the checkpoint is really behaving right
now". The whole point of having a progress view is to have something dynamic
that reflects the current activity.

As of now I will not consider adding this information to the view. If
required and nobody opposes having this included in the 'flags' field
of the view, then I will consider adding.

Thanks & Regards,
Nitin Jadhav

On Mon, Mar 14, 2022 at 5:16 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Mon, Mar 14, 2022 at 03:16:50PM +0530, Nitin Jadhav wrote:

I am not suggesting
removing the existing 'flags' field of pg_stat_progress_checkpoint
view and adding a new field 'throttled'. The content of the 'flags'
field remains the same. I was suggesting replacing the 'next_flags'
field with 'throttled' field since the new request with
CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint.

Are you saying that this new throttled flag will only be set by the overloaded
flags in ckpt_flags?

Yes. you are right.

So you can have a checkpoint with a CHECKPOINT_IMMEDIATE
flags that's throttled, and a checkpoint without the CHECKPOINT_IMMEDIATE flag
that's not throttled?

I think it's the reverse. A checkpoint with a CHECKPOINT_IMMEDIATE
flags that's not throttled (disables delays between writes) and a
checkpoint without the CHECKPOINT_IMMEDIATE flag that's throttled
(enables delays between writes)

Yes that's how it's supposed to work, but my point was that your suggested
'throttled' flag could say the opposite, which is bad.

I don't get it. The checkpoint flags and the view flags (set by
pgstat_progrss_update*) are different, so why can't we add this flag to the
view flags? The fact that checkpointer.c doesn't update the passed flag and
instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set since is
an implementation detail, and the view shouldn't focus on which flags were
initially passed to the checkpointer but instead which flags the checkpointer
is actually enforcing, as that's what the user should be interested in. If you
want to store it in another field internally but display it in the view with
the rest of the flags, I'm fine with it.

Just to be in sync with the way code behaves, it is better not to
update the next checkpoint request's CHECKPOINT_IMMEDIATE with the
current checkpoint 'flags' field. Because the current checkpoint
starts with a different set of flags and when there is a new request
(with CHECKPOINT_IMMEDIATE), it just processes the pending operations
quickly to take up next requests. If we update this information in the
'flags' field of the view, it says that the current checkpoint is
started with CHECKPOINT_IMMEDIATE which is not true.

Which is why I suggested to only take into account CHECKPOINT_REQUESTED (to
be able to display that a new checkpoint was requested) and
CHECKPOINT_IMMEDIATE, to be able to display that the current checkpoint isn't
throttled anymore if it were.

I still don't understand why you want so much to display "how the checkpoint
was initially started" rather than "how the checkpoint is really behaving right
now". The whole point of having a progress view is to have something dynamic
that reflects the current activity.

Hence I had
thought of adding a new field ('next flags' or 'upcoming flags') which
contain all the flag values of new checkpoint requests. This field
indicates whether the current checkpoint is throttled or not and also
it indicates there are new requests.

I'm not opposed to having such a field, I'm opposed to having a view with "the
current checkpoint is throttled but if there are some flags in the next
checkpoint flags and those flags contain checkpoint immediate then the current
checkpoint isn't actually throttled anymore" behavior.

Attachments:

v6-0001-pg_stat_progress_checkpoint-view.patchapplication/octet-stream; name=v6-0001-pg_stat_progress_checkpoint-view.patchDownload

From 104184f6cba7dfe33eb710de8158b21e6937bf51 Mon Sep 17 00:00:00 2001
From: Nitin Jadhav <nitinjadhav@microsoft.com>
Date: Mon, 6 Jun 2022 05:55:06 +0000
Subject: [PATCH] pg_stat_progress_checkpoint-view

---
 doc/src/sgml/monitoring.sgml          | 405 +++++++++++++++++++++++++-
 doc/src/sgml/ref/checkpoint.sgml      |   7 +
 doc/src/sgml/wal.sgml                 |   6 +-
 src/backend/access/transam/xlog.c     | 102 +++++++
 src/backend/catalog/system_views.sql  |  51 ++++
 src/backend/postmaster/checkpointer.c |  15 +-
 src/backend/storage/buffer/bufmgr.c   |   7 +
 src/backend/storage/sync/sync.c       |   6 +
 src/backend/utils/adt/pgstatfuncs.c   |   2 +
 src/include/commands/progress.h       |  38 +++
 src/include/utils/backend_progress.h  |   3 +-
 src/test/regress/expected/rules.out   |  70 +++++
 12 files changed, 706 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4549c2560e..5fe0ba4492 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -414,6 +414,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
        See <xref linkend='copy-progress-reporting'/>.
       </entry>
      </row>
+
+     <row>
+      <entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
+      <entry>One row only, showing the progress of the checkpoint.
+       See <xref linkend='checkpoint-progress-reporting'/>.
+      </entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
@@ -5736,7 +5743,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
    which support progress reporting are <command>ANALYZE</command>,
    <command>CLUSTER</command>,
    <command>CREATE INDEX</command>, <command>VACUUM</command>,
-   <command>COPY</command>,
+   <command>COPY</command>, <command>CHECKPOINT</command>
    and <xref linkend="protocol-replication-base-backup"/> (i.e., replication
    command that <xref linkend="app-pgbasebackup"/> issues to take
    a base backup).
@@ -7024,6 +7031,402 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   </table>
  </sect2>
 
+ <sect2 id="checkpoint-progress-reporting">
+  <title>Checkpoint Progress Reporting</title>
+
+  <indexterm>
+   <primary>pg_stat_progress_checkpoint</primary>
+  </indexterm>
+
+  <para>
+   Whenever the checkpoint operation is running, the
+   <structname>pg_stat_progress_checkpoint</structname> view will contain a
+   single row indicating the progress of the checkpoint. The tables below
+   describe the information that will be reported and provide information about
+   how to interpret it.
+  </para>
+
+  <table id="pg-stat-progress-checkpoint-view" xreflabel="pg_stat_progress_checkpoint">
+   <title><structname>pg_stat_progress_checkpoint</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>pid</structfield> <type>integer</type>
+      </para>
+      <para>
+       Process ID of the checkpointer process.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of the checkpoint. See <xref linkend="checkpoint-types"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>flags</structfield> <type>text</type>
+      </para>
+      <para>
+       Flags of the checkpoint. See <xref linkend="checkpoint-flags"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>start_lsn</structfield> <type>text</type>
+      </para>
+      <para>
+       The checkpoint start location.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>start_time</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Start time of the checkpoint.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>phase</structfield> <type>text</type>
+      </para>
+      <para>
+       Current processing phase. See <xref linkend="checkpoint-phases"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_total</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of buffers to be written. This is estimated and reported
+       as of the beginning of buffer write operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_processed</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers processed. This counter increases when the targeted
+       buffer is processed. This number will eventually become equal to
+       <literal>buffers_total</literal> when the checkpoint is
+       complete.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers written. This counter only advances when the targeted
+       buffers is written. Note that some of the buffers are processed but may
+       not required to be written. So this count will always be  less than or
+       equal to  <literal>buffers_total</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_total</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of files to be synced. This is estimated and reported as of
+       the beginning of sync operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of files synced. This counter advances when the targeted file is
+       synced. This number will eventually become equal to
+       <literal>files_total</literal>  when the checkpoint is complete.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>new_requests</structfield> <type>text</type>
+      </para>
+      <para>
+       True if any of the backend is requested for a checkpoint while the
+       current checkpoint is in progress, False otherwise.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-types">
+   <title>Checkpoint Types</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Types</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>checkpoint</literal></entry>
+      <entry>
+       The current operation is checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>restartpoint</literal></entry>
+      <entry>
+       The current operation is restartpoint.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-flags">
+   <title>Checkpoint Flags</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Flags</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>shutdown</literal></entry>
+      <entry>
+       The checkpoint is for shutdown.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>end-of-recovery</literal></entry>
+      <entry>
+       The checkpoint is for end-of-recovery.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>immediate</literal></entry>
+      <entry>
+       The checkpoint happens without delays.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>force</literal></entry>
+      <entry>
+       The checkpoint is started because some operation (for which the
+       checkpoint is necessary) forced a checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>flush all</literal></entry>
+      <entry>
+       The checkpoint flushes all pages, including those belonging to unlogged
+       tables.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>wait</literal></entry>
+      <entry>
+       The operations which requested the checkpoint waits for completion
+       before returning.
+      </entry>
+     </row>
+      <row>
+      <entry><literal>requested</literal></entry>
+      <entry>
+       The checkpoint request has been made.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>wal</literal></entry>
+      <entry>
+       The checkpoint is started because <literal>max_wal_size</literal> is
+       reached.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>time</literal></entry>
+      <entry>
+       The checkpoint is started because <literal>checkpoint_timeout</literal>
+       expired.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-phases">
+   <title>Checkpoint Phases</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Phase</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>initializing</literal></entry>
+      <entry>
+       The checkpointer process is preparing to begin the checkpoint operation.
+       This phase is expected to be very brief.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>getting virtual transaction IDs</literal></entry>
+      <entry>
+       The checkpointer process is getting the virtual transaction IDs that
+       are delaying the checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing replication slots</literal></entry>
+      <entry>
+       The checkpointer process is currently flushing all the replication slots
+       to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing logical replication snapshot files</literal></entry>
+      <entry>
+       The checkpointer process is currently removing all the serialized
+       snapshot files that are not required anymore.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing logical rewrite mapping files</literal></entry>
+      <entry>
+       The checkpointer process is currently removing unwanted or flushing
+       required logical rewrite mapping files.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing replication origin</literal></entry>
+      <entry>
+       The checkpointer process is currently performing a checkpoint of each
+       replication origin's progress with respect to the replayed remote LSN.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing commit log pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing commit log pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing commit time stamp pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing commit time stamp pages to
+       disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing subtransaction pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing subtransaction pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing multixact pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing multixact pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing predicate lock pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing predicate lock pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing buffers</literal></entry>
+      <entry>
+       The checkpointer process is currently writing buffers to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>processing file sync requests</literal></entry>
+      <entry>
+       The checkpointer process is currently processing file sync requests.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>performing two phase checkpoint</literal></entry>
+      <entry>
+       The checkpointer process is currently performing two phase checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>performing post checkpoint cleanup</literal></entry>
+      <entry>
+       The checkpointer process is currently performing post checkpoint cleanup.
+       It removes any lingering files that can be safely removed.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>invalidating replication slots</literal></entry>
+      <entry>
+       The checkpointer process is currently invalidating replication slots.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>recycling old WAL files</literal></entry>
+      <entry>
+       The checkpointer process is currently recycling old WAL files.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>truncating subtransactions</literal></entry>
+      <entry>
+       The checkpointer process is currently removing the subtransaction
+       segments.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>finalizing</literal></entry>
+      <entry>
+       The checkpointer process is finalizing the checkpoint operation.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  </sect1>
 
  <sect1 id="dynamic-trace">
diff --git a/doc/src/sgml/ref/checkpoint.sgml b/doc/src/sgml/ref/checkpoint.sgml
index 1cebc03d15..f33db50cfc 100644
--- a/doc/src/sgml/ref/checkpoint.sgml
+++ b/doc/src/sgml/ref/checkpoint.sgml
@@ -56,6 +56,13 @@ CHECKPOINT
    the <link linkend="predefined-roles-table"><literal>pg_checkpointer</literal></link>
    role can call <command>CHECKPOINT</command>.
   </para>
+
+  <para>
+    The checkpointer process running the checkpoint will report its progress
+    in the <structname>pg_stat_progress_checkpoint</structname> view except for
+    the shutdown and end-of-recovery cases. See
+    <xref linkend="checkpoint-progress-reporting"/> for details.
+  </para>
  </refsect1>
 
  <refsect1>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 4b6ef283c1..607f21dfd4 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -530,7 +530,11 @@
    adjust the <xref linkend="guc-archive-timeout"/> parameter rather than the
    checkpoint parameters.)
    It is also possible to force a checkpoint by using the SQL
-   command <command>CHECKPOINT</command>.
+   command <command>CHECKPOINT</command>. The checkpointer process running the
+   checkpoint will report its progress in the
+   <structname>pg_stat_progress_checkpoint</structname> view except for the
+   shutdown and end-of-recovery cases. See
+   <xref linkend="checkpoint-progress-reporting"/> for details.
   </para>
 
   <para>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 71136b11a2..8272d02b1e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -66,6 +66,7 @@
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
+#include "commands/progress.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
 #include "executor/instrument.h"
@@ -695,6 +696,8 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static void checkpoint_progress_start(int flags, int type);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -6429,6 +6432,9 @@ CreateCheckPoint(int flags)
 	XLogCtl->RedoRecPtr = checkPoint.redo;
 	SpinLockRelease(&XLogCtl->info_lck);
 
+	/* Prepare to report progress of the checkpoint. */
+	checkpoint_progress_start(flags, PROGRESS_CHECKPOINT_TYPE_CHECKPOINT);
+
 	/*
 	 * If enabled, log checkpoint start.  We postpone this until now so as not
 	 * to log anything if we decided to skip the checkpoint.
@@ -6511,6 +6517,8 @@ CreateCheckPoint(int flags)
 	 * clog and we will correctly flush the update below.  So we cannot miss
 	 * any xacts we need to wait for.
 	 */
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_GET_VIRTUAL_TRANSACTION_IDS);
 	vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
 	if (nvxids > 0)
 	{
@@ -6626,6 +6634,8 @@ CreateCheckPoint(int flags)
 	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_POST_CHECKPOINT_CLEANUP);
 	SyncPostCheckpoint();
 
 	/*
@@ -6641,6 +6651,9 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_INVALIDATE_REPLI_SLOTS);
+
 	if (InvalidateObsoleteReplicationSlots(_logSegNo))
 	{
 		/*
@@ -6651,6 +6664,8 @@ CreateCheckPoint(int flags)
 		KeepLogSeg(recptr, &_logSegNo);
 	}
 	_logSegNo--;
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_RECYCLE_OLD_XLOG);
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr,
 					   checkPoint.ThisTimeLineID);
 
@@ -6669,11 +6684,21 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
+	{
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_TRUNCATE_SUBTRANS);
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+	}
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_FINALIZE);
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(false);
 
+	/* Stop reporting progress of the checkpoint. */
+	pgstat_progress_end_command();
+
 	/* Reset the process title */
 	update_checkpoint_display(flags, false, true);
 
@@ -6830,29 +6855,63 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS);
 	CheckPointReplicationSlots();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS);
 	CheckPointSnapBuild();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS);
 	CheckPointLogicalRewriteHeap();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_REPLI_ORIGIN);
 	CheckPointReplicationOrigin();
 
 	/* Write out all dirty data in SLRUs and the main buffer pool */
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES);
 	CheckPointCLOG();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES);
 	CheckPointCommitTs();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES);
 	CheckPointSUBTRANS();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES);
 	CheckPointMultiXact();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES);
 	CheckPointPredicate();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_BUFFERS);
 	CheckPointBuffers(flags);
 
 	/* Perform all queued up fsyncs */
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SYNC_FILES);
 	ProcessSyncRequests();
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 
 	/* We deliberately delay 2PC checkpointing as long as possible */
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_TWO_PHASE);
 	CheckPointTwoPhase(checkPointRedo);
 }
 
@@ -7002,6 +7061,9 @@ CreateRestartPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	/* Prepare to report progress of the restartpoint. */
+	checkpoint_progress_start(flags, PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT);
+
 	if (log_checkpoints)
 		LogCheckpointStart(flags, true);
 
@@ -7085,6 +7147,9 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_INVALIDATE_REPLI_SLOTS);
+
 	if (InvalidateObsoleteReplicationSlots(_logSegNo))
 	{
 		/*
@@ -7111,6 +7176,8 @@ CreateRestartPoint(int flags)
 	if (!RecoveryInProgress())
 		replayTLI = XLogCtl->InsertTimeLineID;
 
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_RECYCLE_OLD_XLOG);
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr, replayTLI);
 
 	/*
@@ -7127,11 +7194,20 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
+	{
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_TRUNCATE_SUBTRANS);
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+	}
 
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_FINALIZE);
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(true);
 
+	/* Stop reporting progress of the restartpoint. */
+	pgstat_progress_end_command();
+
 	/* Reset the process title */
 	update_checkpoint_display(flags, true, true);
 
@@ -8880,3 +8956,29 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl->WalWriterSleeping = sleeping;
 	SpinLockRelease(&XLogCtl->info_lck);
 }
+
+/*
+ * Start reporting progress of the checkpoint.
+ */
+static void
+checkpoint_progress_start(int flags, int type)
+{
+	const int	index[] = {
+		PROGRESS_CHECKPOINT_TYPE,
+		PROGRESS_CHECKPOINT_FLAGS,
+		PROGRESS_CHECKPOINT_LSN,
+		PROGRESS_CHECKPOINT_START_TIMESTAMP,
+		PROGRESS_CHECKPOINT_PHASE
+		};
+	int64		val[5];
+
+	pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
+
+	val[0] = type;
+	val[1] = flags;
+	val[2] = RedoRecPtr;
+	val[3] = CheckpointStats.ckpt_start_t;
+	val[4] = PROGRESS_CHECKPOINT_PHASE_INIT;
+
+	pgstat_progress_update_multi_param(5, index, val);
+}
\ No newline at end of file
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fedaed533b..20d029a547 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1265,6 +1265,57 @@ CREATE VIEW pg_stat_progress_copy AS
     FROM pg_stat_get_progress_info('COPY') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_progress_checkpoint AS
+    SELECT
+        S.pid AS pid,
+        CASE S.param1 WHEN 1 THEN 'checkpoint'
+                      WHEN 2 THEN 'restartpoint'
+                      END AS type,
+        ( CASE WHEN (S.param2 & 4) > 0 THEN 'immediate ' ELSE '' END ||
+          CASE WHEN (S.param2 & 8) > 0 THEN 'force ' ELSE '' END ||
+          CASE WHEN (S.param2 & 16) > 0 THEN 'flush-all ' ELSE '' END ||
+          CASE WHEN (S.param2 & 32) > 0 THEN 'wait ' ELSE '' END ||
+          CASE WHEN (S.param2 & 128) > 0 THEN 'wal ' ELSE '' END ||
+          CASE WHEN (S.param2 & 256) > 0 THEN 'time ' ELSE '' END
+        ) AS flags,
+        ( '0/0'::pg_lsn +
+          ((CASE
+                WHEN S.param3 < 0 THEN pow(2::numeric, 64::numeric)::numeric
+                ELSE 0::numeric
+            END) +
+           S.param3::numeric)
+        ) AS start_lsn,
+        to_timestamp(946684800 + (S.param4::float8 / 1000000)) AS start_time,
+        CASE S.param5 WHEN 1 THEN 'initializing'
+                      WHEN 2 THEN 'getting virtual transaction IDs'
+                      WHEN 3 THEN 'checkpointing replication slots'
+                      WHEN 4 THEN 'checkpointing logical replication snapshot files'
+                      WHEN 5 THEN 'checkpointing logical rewrite mapping files'
+                      WHEN 6 THEN 'checkpointing replication origin'
+                      WHEN 7 THEN 'checkpointing commit log pages'
+                      WHEN 8 THEN 'checkpointing commit time stamp pages'
+                      WHEN 9 THEN 'checkpointing subtransaction pages'
+                      WHEN 10 THEN 'checkpointing multixact pages'
+                      WHEN 11 THEN 'checkpointing predicate lock pages'
+                      WHEN 12 THEN 'checkpointing buffers'
+                      WHEN 13 THEN 'processing file sync requests'
+                      WHEN 14 THEN 'performing two phase checkpoint'
+                      WHEN 15 THEN 'performing post checkpoint cleanup'
+                      WHEN 16 THEN 'invalidating replication slots'
+                      WHEN 17 THEN 'recycling old WAL files'
+                      WHEN 18 THEN 'truncating subtransactions'
+                      WHEN 19 THEN 'finalizing'
+                      END AS phase,
+        S.param6 AS buffers_total,
+        S.param7 AS buffers_processed,
+        S.param8 AS buffers_written,
+        S.param9 AS files_total,
+        S.param10 AS files_synced,
+        CASE S.param11 WHEN 0 THEN 'false'
+                       WHEN 1 THEN 'true'
+                       END AS new_requests
+    FROM pg_stat_get_progress_info('CHECKPOINT') AS S;
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c937c39f50..79cae7e9ca 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogrecovery.h"
+#include "commands/progress.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -163,7 +164,7 @@ static pg_time_t last_xlog_switch_time;
 static void HandleCheckpointerInterrupts(void);
 static void CheckArchiveTimeout(void);
 static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
+static bool ImmediateCheckpointRequested(int flags);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
@@ -667,16 +668,24 @@ CheckArchiveTimeout(void)
  * there is one pending behind it.)
  */
 static bool
-ImmediateCheckpointRequested(void)
+ImmediateCheckpointRequested(int flags)
 {
 	volatile CheckpointerShmemStruct *cps = CheckpointerShmem;
 
+	if (cps->ckpt_flags & CHECKPOINT_REQUESTED)
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_NEW_REQUESTS, true);
+
 	/*
 	 * We don't need to acquire the ckpt_lck in this case because we're only
 	 * looking at a single flag bit.
 	 */
 	if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
+	{
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_FLAGS,
+									 (flags | CHECKPOINT_IMMEDIATE));
 		return true;
+	}
+
 	return false;
 }
 
@@ -708,7 +717,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
 		!ShutdownRequestPending &&
-		!ImmediateCheckpointRequested() &&
+		!ImmediateCheckpointRequested(flags) &&
 		IsCheckpointOnSchedule(progress))
 	{
 		if (ConfigReloadPending)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ae13011d27..55f03c1301 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -39,6 +39,7 @@
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
+#include "commands/progress.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -2019,6 +2020,8 @@ BufferSync(int flags)
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_TOTAL,
+								 num_to_scan);
 
 	/*
 	 * Sort buffers that need to be written to reduce the likelihood of random
@@ -2136,6 +2139,8 @@ BufferSync(int flags)
 		bufHdr = GetBufferDescriptor(buf_id);
 
 		num_processed++;
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_PROCESSED,
+									 num_processed);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -2156,6 +2161,8 @@ BufferSync(int flags)
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				PendingCheckpointerStats.buf_written_checkpoints++;
 				num_written++;
+				pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_WRITTEN,
+											 num_written);
 			}
 		}
 
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e1fb631003..3acbf94c5e 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -23,6 +23,7 @@
 #include "access/multixact.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "commands/progress.h"
 #include "commands/tablespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -368,6 +369,9 @@ ProcessSyncRequests(void)
 	/* Now scan the hashtable for fsync requests to process */
 	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOps);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_FILES_TOTAL,
+								 hash_get_num_entries(pendingOps));
+
 	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
 	{
 		int			failures;
@@ -431,6 +435,8 @@ ProcessSyncRequests(void)
 						longest = elapsed;
 					total_elapsed += elapsed;
 					processed++;
+					pgstat_progress_update_param(PROGRESS_CHECKPOINT_FILES_SYNCED,
+												 processed);
 
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 893690dad5..c0de766fde 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -476,6 +476,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 		cmdtype = PROGRESS_COMMAND_BASEBACKUP;
 	else if (pg_strcasecmp(cmd, "COPY") == 0)
 		cmdtype = PROGRESS_COMMAND_COPY;
+	 else if (pg_strcasecmp(cmd, "CHECKPOINT") == 0)
+		cmdtype = PROGRESS_COMMAND_CHECKPOINT;
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index a28938caf4..33a64d2f0b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -151,4 +151,42 @@
 #define PROGRESS_COPY_TYPE_PIPE 3
 #define PROGRESS_COPY_TYPE_CALLBACK 4
 
+/* Progress parameters for checkpoint */
+#define PROGRESS_CHECKPOINT_TYPE                    0
+#define PROGRESS_CHECKPOINT_FLAGS                   1
+#define PROGRESS_CHECKPOINT_LSN                     2
+#define PROGRESS_CHECKPOINT_START_TIMESTAMP         3
+#define PROGRESS_CHECKPOINT_PHASE                   4
+#define PROGRESS_CHECKPOINT_BUFFERS_TOTAL           5
+#define PROGRESS_CHECKPOINT_BUFFERS_PROCESSED       6
+#define PROGRESS_CHECKPOINT_BUFFERS_WRITTEN         7
+#define PROGRESS_CHECKPOINT_FILES_TOTAL             8
+#define PROGRESS_CHECKPOINT_FILES_SYNCED            9
+#define PROGRESS_CHECKPOINT_NEW_REQUESTS            10
+
+/* Types of checkpoint (as advertised via PROGRESS_CHECKPOINT_TYPE) */
+#define PROGRESS_CHECKPOINT_TYPE_CHECKPOINT         1
+#define PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT       2
+
+/* Phases of checkpoint (as advertised via PROGRESS_CHECKPOINT_PHASE) */
+#define PROGRESS_CHECKPOINT_PHASE_INIT                          1
+#define PROGRESS_CHECKPOINT_PHASE_GET_VIRTUAL_TRANSACTION_IDS   2
+#define PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS                   3
+#define PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS                     4
+#define PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS      5
+#define PROGRESS_CHECKPOINT_PHASE_REPLI_ORIGIN                  6
+#define PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES                    7
+#define PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES                8
+#define PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES                9
+#define PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES               10
+#define PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES          11
+#define PROGRESS_CHECKPOINT_PHASE_BUFFERS                       12
+#define PROGRESS_CHECKPOINT_PHASE_SYNC_FILES                    13
+#define PROGRESS_CHECKPOINT_PHASE_TWO_PHASE                     14
+#define PROGRESS_CHECKPOINT_PHASE_POST_CHECKPOINT_CLEANUP       15
+#define PROGRESS_CHECKPOINT_PHASE_INVALIDATE_REPLI_SLOTS        16
+#define PROGRESS_CHECKPOINT_PHASE_RECYCLE_OLD_XLOG              17
+#define PROGRESS_CHECKPOINT_PHASE_TRUNCATE_SUBTRANS             18
+#define PROGRESS_CHECKPOINT_PHASE_FINALIZE                      19
+
 #endif
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 47bf8029b0..02d51fb948 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -27,7 +27,8 @@ typedef enum ProgressCommandType
 	PROGRESS_COMMAND_CLUSTER,
 	PROGRESS_COMMAND_CREATE_INDEX,
 	PROGRESS_COMMAND_BASEBACKUP,
-	PROGRESS_COMMAND_COPY
+	PROGRESS_COMMAND_COPY,
+	PROGRESS_COMMAND_CHECKPOINT
 } ProgressCommandType;
 
 #define PGSTAT_NUM_PROGRESS_PARAM	20
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fc3cde3226..ec130dad2a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1912,6 +1912,76 @@ pg_stat_progress_basebackup| SELECT s.pid,
     s.param4 AS tablespaces_total,
     s.param5 AS tablespaces_streamed
    FROM pg_stat_get_progress_info('BASEBACKUP'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
+pg_stat_progress_checkpoint| SELECT s.pid,
+        CASE s.param1
+            WHEN 1 THEN 'checkpoint'::text
+            WHEN 2 THEN 'restartpoint'::text
+            ELSE NULL::text
+        END AS type,
+    (((((
+        CASE
+            WHEN ((s.param2 & (4)::bigint) > 0) THEN 'immediate '::text
+            ELSE ''::text
+        END ||
+        CASE
+            WHEN ((s.param2 & (8)::bigint) > 0) THEN 'force '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (16)::bigint) > 0) THEN 'flush-all '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (32)::bigint) > 0) THEN 'wait '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (128)::bigint) > 0) THEN 'wal '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (256)::bigint) > 0) THEN 'time '::text
+            ELSE ''::text
+        END) AS flags,
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,
+    to_timestamp(((946684800)::double precision + ((s.param4)::double precision / (1000000)::double precision))) AS start_time,
+        CASE s.param5
+            WHEN 1 THEN 'initializing'::text
+            WHEN 2 THEN 'getting virtual transaction IDs'::text
+            WHEN 3 THEN 'checkpointing replication slots'::text
+            WHEN 4 THEN 'checkpointing logical replication snapshot files'::text
+            WHEN 5 THEN 'checkpointing logical rewrite mapping files'::text
+            WHEN 6 THEN 'checkpointing replication origin'::text
+            WHEN 7 THEN 'checkpointing commit log pages'::text
+            WHEN 8 THEN 'checkpointing commit time stamp pages'::text
+            WHEN 9 THEN 'checkpointing subtransaction pages'::text
+            WHEN 10 THEN 'checkpointing multixact pages'::text
+            WHEN 11 THEN 'checkpointing predicate lock pages'::text
+            WHEN 12 THEN 'checkpointing buffers'::text
+            WHEN 13 THEN 'processing file sync requests'::text
+            WHEN 14 THEN 'performing two phase checkpoint'::text
+            WHEN 15 THEN 'performing post checkpoint cleanup'::text
+            WHEN 16 THEN 'invalidating replication slots'::text
+            WHEN 17 THEN 'recycling old WAL files'::text
+            WHEN 18 THEN 'truncating subtransactions'::text
+            WHEN 19 THEN 'finalizing'::text
+            ELSE NULL::text
+        END AS phase,
+    s.param6 AS buffers_total,
+    s.param7 AS buffers_processed,
+    s.param8 AS buffers_written,
+    s.param9 AS files_total,
+    s.param10 AS files_synced,
+        CASE s.param11
+            WHEN 0 THEN 'false'::text
+            WHEN 1 THEN 'true'::text
+            ELSE NULL::text
+        END AS new_requests
+   FROM pg_stat_get_progress_info('CHECKPOINT'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
 pg_stat_progress_cluster| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.25.1

#78

Nitin Jadhav

nitinjadhavpostgres@gmail.com

over 3 years ago

In reply to: Andres Freund (#71)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Have you measured the performance effects of this? On fast storage with large
shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
be good to verify that.

To understand the performance effects of the above, I have taken the
average of five checkpoints with the patch and without the patch in my
environment. Here are the results.
With patch: 269.65 s
Without patch: 269.60 s

It looks fine. Please share your views.

This view is depressingly complicated. Added up the view definitions for
the already existing pg_stat_progress* views add up to a measurable part of
the size of an empty database:

Thank you so much for sharing the detailed analysis. We can remove a
few fields which are not so important to make it simple.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Sat, Mar 19, 2022 at 5:45 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

This is a long thread, sorry for asking if this has been asked before.

On 2022-03-08 20:25:28 +0530, Nitin Jadhav wrote:
* Sort buffers that need to be written to reduce the likelihood of random
@@ -2129,6 +2132,8 @@ BufferSync(int flags)
bufHdr = GetBufferDescriptor(buf_id);
num_processed++;
+             pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_PROCESSED,
+                                                                      num_processed);
/*
* We don't need to acquire the lock here, because we're only looking
@@ -2149,6 +2154,8 @@ BufferSync(int flags)
TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
PendingCheckpointerStats.m_buf_written_checkpoints++;
num_written++;
+                             pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_WRITTEN,
+                                                                                      num_written);
}
}
Have you measured the performance effects of this? On fast storage with large
shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
be good to verify that.
@@ -1897,6 +1897,112 @@ pg_stat_progress_basebackup| SELECT s.pid,
s.param4 AS tablespaces_total,
s.param5 AS tablespaces_streamed
FROM pg_stat_get_progress_info('BASEBACKUP'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
+pg_stat_progress_checkpoint| SELECT s.pid,
+        CASE s.param1
+            WHEN 1 THEN 'checkpoint'::text
+            WHEN 2 THEN 'restartpoint'::text
+            ELSE NULL::text
+        END AS type,
+    (((((((
+        CASE
+            WHEN ((s.param2 & (1)::bigint) > 0) THEN 'shutdown '::text
+            ELSE ''::text
+        END ||
+        CASE
+            WHEN ((s.param2 & (2)::bigint) > 0) THEN 'end-of-recovery '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (4)::bigint) > 0) THEN 'immediate '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (8)::bigint) > 0) THEN 'force '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (16)::bigint) > 0) THEN 'flush-all '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (32)::bigint) > 0) THEN 'wait '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (128)::bigint) > 0) THEN 'wal '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (256)::bigint) > 0) THEN 'time '::text
+            ELSE ''::text
+        END) AS flags,
+    (((((((
+        CASE
+            WHEN ((s.param3 & (1)::bigint) > 0) THEN 'shutdown '::text
+            ELSE ''::text
+        END ||
+        CASE
+            WHEN ((s.param3 & (2)::bigint) > 0) THEN 'end-of-recovery '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (4)::bigint) > 0) THEN 'immediate '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (8)::bigint) > 0) THEN 'force '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (16)::bigint) > 0) THEN 'flush-all '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (32)::bigint) > 0) THEN 'wait '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (128)::bigint) > 0) THEN 'wal '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param3 & (256)::bigint) > 0) THEN 'time '::text
+            ELSE ''::text
+        END) AS next_flags,
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param4 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param4)::numeric)) AS start_lsn,
+    to_timestamp(((946684800)::double precision + ((s.param5)::double precision / (1000000)::double precision))) AS start_time,
+        CASE s.param6
+            WHEN 1 THEN 'initializing'::text
+            WHEN 2 THEN 'getting virtual transaction IDs'::text
+            WHEN 3 THEN 'checkpointing replication slots'::text
+            WHEN 4 THEN 'checkpointing logical replication snapshot files'::text
+            WHEN 5 THEN 'checkpointing logical rewrite mapping files'::text
+            WHEN 6 THEN 'checkpointing replication origin'::text
+            WHEN 7 THEN 'checkpointing commit log pages'::text
+            WHEN 8 THEN 'checkpointing commit time stamp pages'::text
+            WHEN 9 THEN 'checkpointing subtransaction pages'::text
+            WHEN 10 THEN 'checkpointing multixact pages'::text
+            WHEN 11 THEN 'checkpointing predicate lock pages'::text
+            WHEN 12 THEN 'checkpointing buffers'::text
+            WHEN 13 THEN 'processing file sync requests'::text
+            WHEN 14 THEN 'performing two phase checkpoint'::text
+            WHEN 15 THEN 'performing post checkpoint cleanup'::text
+            WHEN 16 THEN 'invalidating replication slots'::text
+            WHEN 17 THEN 'recycling old WAL files'::text
+            WHEN 18 THEN 'truncating subtransactions'::text
+            WHEN 19 THEN 'finalizing'::text
+            ELSE NULL::text
+        END AS phase,
+    s.param7 AS buffers_total,
+    s.param8 AS buffers_processed,
+    s.param9 AS buffers_written,
+    s.param10 AS files_total,
+    s.param11 AS files_synced
+   FROM pg_stat_get_progress_info('CHECKPOINT'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
pg_stat_progress_cluster| SELECT s.pid,
s.datid,
d.datname,
This view is depressingly complicated. Added up the view definitions for
the already existing pg_stat_progress* views add up to a measurable part of
the size of an empty database:

postgres[1160866][1]=# SELECT sum(octet_length(ev_action)), SUM(pg_column_size(ev_action)) FROM pg_rewrite WHERE ev_class::regclass::text LIKE '%progress%';
┌───────┬───────┐
│ sum │ sum │
├───────┼───────┤
│ 97410 │ 19786 │
└───────┴───────┘
(1 row)

and this view looks to be a good bit more complicated than the existing
pg_stat_progress* views.

Indeed:
template1[1165473][1]=# SELECT ev_class::regclass, length(ev_action), pg_column_size(ev_action) FROM pg_rewrite WHERE ev_class::regclass::text LIKE '%progress%' ORDER BY length(ev_action) DESC;
┌───────────────────────────────┬────────┬────────────────┐
│ ev_class │ length │ pg_column_size │
├───────────────────────────────┼────────┼────────────────┤
│ pg_stat_progress_checkpoint │ 43290 │ 5409 │
│ pg_stat_progress_create_index │ 23293 │ 4177 │
│ pg_stat_progress_cluster │ 18390 │ 3704 │
│ pg_stat_progress_analyze │ 16121 │ 3339 │
│ pg_stat_progress_vacuum │ 16076 │ 3392 │
│ pg_stat_progress_copy │ 15124 │ 3080 │
│ pg_stat_progress_basebackup │ 8406 │ 2094 │
└───────────────────────────────┴────────┴────────────────┘
(7 rows)

pg_rewrite without pg_stat_progress_checkpoint: 745472, with: 753664

pg_rewrite is the second biggest relation in an empty database already...

template1[1164827][1]=# SELECT relname, pg_total_relation_size(oid) FROM pg_class WHERE relkind = 'r' ORDER BY 2 DESC LIMIT 5;
┌────────────────┬────────────────────────┐
│ relname │ pg_total_relation_size │
├────────────────┼────────────────────────┤
│ pg_proc │ 1212416 │
│ pg_rewrite │ 745472 │
│ pg_attribute │ 704512 │
│ pg_description │ 630784 │
│ pg_collation │ 409600 │
└────────────────┴────────────────────────┘
(5 rows)

Greetings,

Andres Freund

#79

Nitin Jadhav

nitinjadhavpostgres@gmail.com

over 3 years ago

In reply to: Michael Paquier (#72)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Have you measured the performance effects of this? On fast storage with large
shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
be good to verify that.

I am wondering if we could make the function inlined at some point.
We could also play it safe and only update the counters every N loops
instead.

The idea looks good but based on the performance numbers shared above,
it is not affecting the performance. So we can use the current
approach as it gives more accurate progress.
---

This view is depressingly complicated. Added up the view definitions for
the already existing pg_stat_progress* views add up to a measurable part of
the size of an empty database:

Yeah. I think that what's proposed could be simplified, and we had
better remove the fields that are not that useful. First, do we have
any need for next_flags?

"next_flags" is removed in the v6 patch. Added a "new_requests" field
to get to know whether the current checkpoint is being throttled or
not. Please share your views on this.
---

Second, is the start LSN really necessary
for monitoring purposes?

IMO, start LSN is necessary to debug if the checkpoint is taking longer.
---

Not all the information in the first
parameter is useful, as well. For example "shutdown" will never be
seen as it is not possible to use a session at this stage, no?

I understand that "shutdown" and "end-of-recovery" will never be seen
and I have removed it in the v6 patch.
---

There
is also no gain in having "immediate", "flush-all", "force" and "wait"
(for this one if the checkpoint is requested the session doing the
work knows this information already).

"immediate" is required to understand whether the current checkpoint
is throttled or not. I am not sure about other flags "flush-all",
"force" and "wait". I have just supported all the flags to match the
'checkpoint start' log message. Please share your views. If it is not
really required, I will remove it in the next patch.
---

A last thing is that we may gain in visibility by having more
attributes as an effect of splitting param2. On thing that would make
sense is to track the reason why the checkpoint was triggered
separately (aka wal and time). Should we use a text[] instead to list
all the parameters instead? Using a space-separated list of items is
not intuitive IMO, and callers of this routine will likely parse
that.

If I understand the above comment correctly, you are saying to
introduce a new field, say "reason" ( possible values are either wal
or time) and the "flags" field will continue to represent the other
flags like "immediate", etc. The idea looks good here. We can
introduce new field "reason" and "flags" field can be renamed to
"throttled" (true/false) if we decide to not support other flags
"flush-all", "force" and "wait".
---

+                      WHEN 3 THEN 'checkpointing replication slots'
+                      WHEN 4 THEN 'checkpointing logical replication snapshot files'
+                      WHEN 5 THEN 'checkpointing logical rewrite mapping files'
+                      WHEN 6 THEN 'checkpointing replication origin'
+                      WHEN 7 THEN 'checkpointing commit log pages'
+                      WHEN 8 THEN 'checkpointing commit time stamp pages'
There is a lot of "checkpointing" here.  All those terms could be
shorter without losing their meaning.

I will try to make it short in the next patch.
---

Please share your thoughts.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Tue, Apr 5, 2022 at 3:15 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Mar 18, 2022 at 05:15:56PM -0700, Andres Freund wrote:

Have you measured the performance effects of this? On fast storage with large
shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
be good to verify that.

I am wondering if we could make the function inlined at some point.
We could also play it safe and only update the counters every N loops
instead.

This view is depressingly complicated. Added up the view definitions for
the already existing pg_stat_progress* views add up to a measurable part of
the size of an empty database:

Yeah. I think that what's proposed could be simplified, and we had
better remove the fields that are not that useful. First, do we have
any need for next_flags? Second, is the start LSN really necessary
for monitoring purposes? Not all the information in the first
parameter is useful, as well. For example "shutdown" will never be
seen as it is not possible to use a session at this stage, no? There
is also no gain in having "immediate", "flush-all", "force" and "wait"
(for this one if the checkpoint is requested the session doing the
work knows this information already).

A last thing is that we may gain in visibility by having more
attributes as an effect of splitting param2. On thing that would make
sense is to track the reason why the checkpoint was triggered
separately (aka wal and time). Should we use a text[] instead to list
all the parameters instead? Using a space-separated list of items is
not intuitive IMO, and callers of this routine will likely parse
that.

Shouldn't we also track the number of files flushed in each sub-step?
In some deployments you could have a large number of 2PC files and
such. We may want more information on such matters.
+                      WHEN 3 THEN 'checkpointing replication slots'
+                      WHEN 4 THEN 'checkpointing logical replication snapshot files'
+                      WHEN 5 THEN 'checkpointing logical rewrite mapping files'
+                      WHEN 6 THEN 'checkpointing replication origin'
+                      WHEN 7 THEN 'checkpointing commit log pages'
+                      WHEN 8 THEN 'checkpointing commit time stamp pages'
There is a lot of "checkpointing" here.  All those terms could be
shorter without losing their meaning.
This patch still needs some work, so I am marking it as RwF for now.
--
Michael

#80

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Nitin Jadhav (#78)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

On 2022-06-13 19:08:35 +0530, Nitin Jadhav wrote:

Have you measured the performance effects of this? On fast storage with large
shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
be good to verify that.

To understand the performance effects of the above, I have taken the
average of five checkpoints with the patch and without the patch in my
environment. Here are the results.
With patch: 269.65 s
Without patch: 269.60 s

Those look like timed checkpoints - if the checkpoints are sleeping a
part of the time, you're not going to see any potential overhead.

To see whether this has an effect you'd have to make sure there's a
certain number of dirty buffers (e.g. by doing CREATE TABLE AS
some_query) and then do a manual checkpoint and time how long that
times.

Greetings,

Andres Freund

#81

Nitin Jadhav

nitinjadhavpostgres@gmail.com

over 3 years ago

In reply to: Andres Freund (#80)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

To understand the performance effects of the above, I have taken the
average of five checkpoints with the patch and without the patch in my
environment. Here are the results.
With patch: 269.65 s
Without patch: 269.60 s

Those look like timed checkpoints - if the checkpoints are sleeping a
part of the time, you're not going to see any potential overhead.

Yes. The above data is collected from timed checkpoints.

create table t1(a int);
insert into t1 select * from generate_series(1,10000000);

I generated a lot of data by using the above queries which would in
turn trigger the checkpoint (wal).
---

To see whether this has an effect you'd have to make sure there's a
certain number of dirty buffers (e.g. by doing CREATE TABLE AS
some_query) and then do a manual checkpoint and time how long that
times.

For this case I have generated data by using below queries.

create table t1(a int);
insert into t1 select * from generate_series(1,8000000);

This does not trigger the checkpoint automatically. I have issued the
CHECKPOINT manually and measured the performance by considering an
average of 5 checkpoints. Here are the details.

With patch: 2.457 s
Without patch: 2.334 s

Please share your thoughts.

Thanks & Regards,
Nitin Jadhav

Show quoted text

On Thu, Jul 7, 2022 at 5:34 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-06-13 19:08:35 +0530, Nitin Jadhav wrote:

Have you measured the performance effects of this? On fast storage with large
shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
be good to verify that.

To understand the performance effects of the above, I have taken the
average of five checkpoints with the patch and without the patch in my
environment. Here are the results.
With patch: 269.65 s
Without patch: 269.60 s

Those look like timed checkpoints - if the checkpoints are sleeping a
part of the time, you're not going to see any potential overhead.

To see whether this has an effect you'd have to make sure there's a
certain number of dirty buffers (e.g. by doing CREATE TABLE AS
some_query) and then do a manual checkpoint and time how long that
times.

Greetings,

Andres Freund

#82

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Nitin Jadhav (#81)

1 attachment(s)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

On 7/28/22 11:38 AM, Nitin Jadhav wrote:

To understand the performance effects of the above, I have taken the
average of five checkpoints with the patch and without the patch in my
environment. Here are the results.
With patch: 269.65 s
Without patch: 269.60 s

Those look like timed checkpoints - if the checkpoints are sleeping a
part of the time, you're not going to see any potential overhead.

Yes. The above data is collected from timed checkpoints.

create table t1(a int);
insert into t1 select * from generate_series(1,10000000);

I generated a lot of data by using the above queries which would in
turn trigger the checkpoint (wal).
---

To see whether this has an effect you'd have to make sure there's a
certain number of dirty buffers (e.g. by doing CREATE TABLE AS
some_query) and then do a manual checkpoint and time how long that
times.

For this case I have generated data by using below queries.

create table t1(a int);
insert into t1 select * from generate_series(1,8000000);

This does not trigger the checkpoint automatically. I have issued the
CHECKPOINT manually and measured the performance by considering an
average of 5 checkpoints. Here are the details.

With patch: 2.457 s
Without patch: 2.334 s

Please share your thoughts.

v6 was not applying anymore, due to a change in
doc/src/sgml/ref/checkpoint.sgml done by b9eb0ff09e (Rename
pg_checkpointer predefined role to pg_checkpoint).

Please find attached a rebase in v7.

While working on this rebase, I also noticed that "pg_checkpointer" is
still mentioned in some translation files:
"
$ git grep pg_checkpointer
src/backend/po/de.po:msgid "must be superuser or have privileges of
pg_checkpointer to do CHECKPOINT"
src/backend/po/ja.po:msgid "must be superuser or have privileges of
pg_checkpointer to do CHECKPOINT"
src/backend/po/ja.po:msgstr
"CHECKPOINTを実行するにはスーパーユーザーであるか、またはpg_checkpointerの権限を持つ必要があります"
src/backend/po/sv.po:msgid "must be superuser or have privileges of
pg_checkpointer to do CHECKPOINT"
"

I'm not familiar with how the translation files are handled (looks like
they have their own set of commits, see 3c0bcdbc66 for example) but
wanted to mention that "pg_checkpointer" is still mentioned (even if
that may be expected as the last commit related to translation files
(aka 3c0bcdbc66) is older than the one that renamed pg_checkpointer to
pg_checkpoint (aka b9eb0ff09e)).

That said, back to this patch: I did not look closely but noticed that
the buffers_total reported by pg_stat_progress_checkpoint:

is a little bit different from what is logged once completed:

2022-11-04 08:18:50.806 UTC [3488442] LOG: checkpoint complete: wrote
1024278 buffers (97.7%);

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v7-0001-pg_stat_progress_checkpoint-view.patchtext/plain; charset=UTF-8; name=v7-0001-pg_stat_progress_checkpoint-view.patchDownload

commit 5198f5010be0febd019b1888817e2d78b8e42b21
Author: bdrouvot <bdrouvot@gmail.com>
Date:   Thu Nov 3 12:59:10 2022 +0000

    v7-0001-pg_stat_progress_checkpoint-view.patch

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5d622d514..ceb7d60ffa 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -414,6 +414,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
        See <xref linkend='copy-progress-reporting'/>.
       </entry>
      </row>
+
+     <row>
+      <entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
+      <entry>One row only, showing the progress of the checkpoint.
+       See <xref linkend='checkpoint-progress-reporting'/>.
+      </entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
@@ -5784,7 +5791,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
    which support progress reporting are <command>ANALYZE</command>,
    <command>CLUSTER</command>,
    <command>CREATE INDEX</command>, <command>VACUUM</command>,
-   <command>COPY</command>,
+   <command>COPY</command>, <command>CHECKPOINT</command>
    and <xref linkend="protocol-replication-base-backup"/> (i.e., replication
    command that <xref linkend="app-pgbasebackup"/> issues to take
    a base backup).
@@ -7072,6 +7079,402 @@ FROM pg_stat_get_backend_idset() AS backendid;
   </table>
  </sect2>
 
+ <sect2 id="checkpoint-progress-reporting">
+  <title>Checkpoint Progress Reporting</title>
+
+  <indexterm>
+   <primary>pg_stat_progress_checkpoint</primary>
+  </indexterm>
+
+  <para>
+   Whenever the checkpoint operation is running, the
+   <structname>pg_stat_progress_checkpoint</structname> view will contain a
+   single row indicating the progress of the checkpoint. The tables below
+   describe the information that will be reported and provide information about
+   how to interpret it.
+  </para>
+
+  <table id="pg-stat-progress-checkpoint-view" xreflabel="pg_stat_progress_checkpoint">
+   <title><structname>pg_stat_progress_checkpoint</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>pid</structfield> <type>integer</type>
+      </para>
+      <para>
+       Process ID of the checkpointer process.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of the checkpoint. See <xref linkend="checkpoint-types"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>flags</structfield> <type>text</type>
+      </para>
+      <para>
+       Flags of the checkpoint. See <xref linkend="checkpoint-flags"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>start_lsn</structfield> <type>text</type>
+      </para>
+      <para>
+       The checkpoint start location.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>start_time</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Start time of the checkpoint.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>phase</structfield> <type>text</type>
+      </para>
+      <para>
+       Current processing phase. See <xref linkend="checkpoint-phases"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_total</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of buffers to be written. This is estimated and reported
+       as of the beginning of buffer write operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_processed</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers processed. This counter increases when the targeted
+       buffer is processed. This number will eventually become equal to
+       <literal>buffers_total</literal> when the checkpoint is
+       complete.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers written. This counter only advances when the targeted
+       buffers is written. Note that some of the buffers are processed but may
+       not required to be written. So this count will always be  less than or
+       equal to  <literal>buffers_total</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_total</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of files to be synced. This is estimated and reported as of
+       the beginning of sync operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of files synced. This counter advances when the targeted file is
+       synced. This number will eventually become equal to
+       <literal>files_total</literal>  when the checkpoint is complete.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>new_requests</structfield> <type>text</type>
+      </para>
+      <para>
+       True if any of the backend is requested for a checkpoint while the
+       current checkpoint is in progress, False otherwise.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-types">
+   <title>Checkpoint Types</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Types</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>checkpoint</literal></entry>
+      <entry>
+       The current operation is checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>restartpoint</literal></entry>
+      <entry>
+       The current operation is restartpoint.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-flags">
+   <title>Checkpoint Flags</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Flags</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>shutdown</literal></entry>
+      <entry>
+       The checkpoint is for shutdown.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>end-of-recovery</literal></entry>
+      <entry>
+       The checkpoint is for end-of-recovery.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>immediate</literal></entry>
+      <entry>
+       The checkpoint happens without delays.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>force</literal></entry>
+      <entry>
+       The checkpoint is started because some operation (for which the
+       checkpoint is necessary) forced a checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>flush all</literal></entry>
+      <entry>
+       The checkpoint flushes all pages, including those belonging to unlogged
+       tables.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>wait</literal></entry>
+      <entry>
+       The operations which requested the checkpoint waits for completion
+       before returning.
+      </entry>
+     </row>
+      <row>
+      <entry><literal>requested</literal></entry>
+      <entry>
+       The checkpoint request has been made.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>wal</literal></entry>
+      <entry>
+       The checkpoint is started because <literal>max_wal_size</literal> is
+       reached.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>time</literal></entry>
+      <entry>
+       The checkpoint is started because <literal>checkpoint_timeout</literal>
+       expired.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="checkpoint-phases">
+   <title>Checkpoint Phases</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+     <row>
+      <entry>Phase</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><literal>initializing</literal></entry>
+      <entry>
+       The checkpointer process is preparing to begin the checkpoint operation.
+       This phase is expected to be very brief.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>getting virtual transaction IDs</literal></entry>
+      <entry>
+       The checkpointer process is getting the virtual transaction IDs that
+       are delaying the checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing replication slots</literal></entry>
+      <entry>
+       The checkpointer process is currently flushing all the replication slots
+       to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing logical replication snapshot files</literal></entry>
+      <entry>
+       The checkpointer process is currently removing all the serialized
+       snapshot files that are not required anymore.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing logical rewrite mapping files</literal></entry>
+      <entry>
+       The checkpointer process is currently removing unwanted or flushing
+       required logical rewrite mapping files.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing replication origin</literal></entry>
+      <entry>
+       The checkpointer process is currently performing a checkpoint of each
+       replication origin's progress with respect to the replayed remote LSN.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing commit log pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing commit log pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing commit time stamp pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing commit time stamp pages to
+       disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing subtransaction pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing subtransaction pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing multixact pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing multixact pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing predicate lock pages</literal></entry>
+      <entry>
+       The checkpointer process is currently writing predicate lock pages to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>checkpointing buffers</literal></entry>
+      <entry>
+       The checkpointer process is currently writing buffers to disk.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>processing file sync requests</literal></entry>
+      <entry>
+       The checkpointer process is currently processing file sync requests.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>performing two phase checkpoint</literal></entry>
+      <entry>
+       The checkpointer process is currently performing two phase checkpoint.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>performing post checkpoint cleanup</literal></entry>
+      <entry>
+       The checkpointer process is currently performing post checkpoint cleanup.
+       It removes any lingering files that can be safely removed.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>invalidating replication slots</literal></entry>
+      <entry>
+       The checkpointer process is currently invalidating replication slots.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>recycling old WAL files</literal></entry>
+      <entry>
+       The checkpointer process is currently recycling old WAL files.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>truncating subtransactions</literal></entry>
+      <entry>
+       The checkpointer process is currently removing the subtransaction
+       segments.
+      </entry>
+     </row>
+     <row>
+      <entry><literal>finalizing</literal></entry>
+      <entry>
+       The checkpointer process is finalizing the checkpoint operation.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  </sect1>
 
  <sect1 id="dynamic-trace">
diff --git a/doc/src/sgml/ref/checkpoint.sgml b/doc/src/sgml/ref/checkpoint.sgml
index 28a1d717b8..906883336d 100644
--- a/doc/src/sgml/ref/checkpoint.sgml
+++ b/doc/src/sgml/ref/checkpoint.sgml
@@ -55,6 +55,14 @@ CHECKPOINT
    Only superusers or users with the privileges of
    the <link linkend="predefined-roles-table"><literal>pg_checkpoint</literal></link>
    role can call <command>CHECKPOINT</command>.
+
+  <para>
+    The checkpointer process running the checkpoint will report its progress
+    in the <structname>pg_stat_progress_checkpoint</structname> view except for
+    the shutdown and end-of-recovery cases. See
+    <xref linkend="checkpoint-progress-reporting"/> for details.
+  </para>
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 6a38b53744..733a8d2837 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -531,7 +531,11 @@
    adjust the <xref linkend="guc-archive-timeout"/> parameter rather than the
    checkpoint parameters.)
    It is also possible to force a checkpoint by using the SQL
-   command <command>CHECKPOINT</command>.
+   command <command>CHECKPOINT</command>. The checkpointer process running the
+   checkpoint will report its progress in the
+   <structname>pg_stat_progress_checkpoint</structname> view except for the
+   shutdown and end-of-recovery cases. See
+   <xref linkend="checkpoint-progress-reporting"/> for details.
   </para>
 
   <para>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index be54c23187..f2cebe0973 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -67,6 +67,7 @@
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
+#include "commands/progress.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
 #include "executor/instrument.h"
@@ -698,6 +699,8 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static void checkpoint_progress_start(int flags, int type);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -6622,6 +6625,9 @@ CreateCheckPoint(int flags)
 	XLogCtl->RedoRecPtr = checkPoint.redo;
 	SpinLockRelease(&XLogCtl->info_lck);
 
+	/* Prepare to report progress of the checkpoint. */
+	checkpoint_progress_start(flags, PROGRESS_CHECKPOINT_TYPE_CHECKPOINT);
+
 	/*
 	 * If enabled, log checkpoint start.  We postpone this until now so as not
 	 * to log anything if we decided to skip the checkpoint.
@@ -6704,6 +6710,8 @@ CreateCheckPoint(int flags)
 	 * clog and we will correctly flush the update below.  So we cannot miss
 	 * any xacts we need to wait for.
 	 */
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_GET_VIRTUAL_TRANSACTION_IDS);
 	vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
 	if (nvxids > 0)
 	{
@@ -6819,6 +6827,8 @@ CreateCheckPoint(int flags)
 	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_POST_CHECKPOINT_CLEANUP);
 	SyncPostCheckpoint();
 
 	/*
@@ -6834,6 +6844,9 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_INVALIDATE_REPLI_SLOTS);
+
 	if (InvalidateObsoleteReplicationSlots(_logSegNo))
 	{
 		/*
@@ -6844,6 +6857,8 @@ CreateCheckPoint(int flags)
 		KeepLogSeg(recptr, &_logSegNo);
 	}
 	_logSegNo--;
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_RECYCLE_OLD_XLOG);
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr,
 					   checkPoint.ThisTimeLineID);
 
@@ -6862,11 +6877,21 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
+	{
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_TRUNCATE_SUBTRANS);
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+	}
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_FINALIZE);
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(false);
 
+	/* Stop reporting progress of the checkpoint. */
+	pgstat_progress_end_command();
+
 	/* Reset the process title */
 	update_checkpoint_display(flags, false, true);
 
@@ -7023,29 +7048,63 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointRelationMap();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS);
 	CheckPointReplicationSlots();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS);
 	CheckPointSnapBuild();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS);
 	CheckPointLogicalRewriteHeap();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_REPLI_ORIGIN);
 	CheckPointReplicationOrigin();
 
 	/* Write out all dirty data in SLRUs and the main buffer pool */
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES);
 	CheckPointCLOG();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES);
 	CheckPointCommitTs();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES);
 	CheckPointSUBTRANS();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES);
 	CheckPointMultiXact();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES);
 	CheckPointPredicate();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_BUFFERS);
 	CheckPointBuffers(flags);
 
 	/* Perform all queued up fsyncs */
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SYNC_FILES);
 	ProcessSyncRequests();
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 
 	/* We deliberately delay 2PC checkpointing as long as possible */
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_TWO_PHASE);
 	CheckPointTwoPhase(checkPointRedo);
 }
 
@@ -7195,6 +7254,9 @@ CreateRestartPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	/* Prepare to report progress of the restartpoint. */
+	checkpoint_progress_start(flags, PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT);
+
 	if (log_checkpoints)
 		LogCheckpointStart(flags, true);
 
@@ -7278,6 +7340,9 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_INVALIDATE_REPLI_SLOTS);
+
 	if (InvalidateObsoleteReplicationSlots(_logSegNo))
 	{
 		/*
@@ -7304,6 +7369,8 @@ CreateRestartPoint(int flags)
 	if (!RecoveryInProgress())
 		replayTLI = XLogCtl->InsertTimeLineID;
 
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_RECYCLE_OLD_XLOG);
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr, replayTLI);
 
 	/*
@@ -7320,11 +7387,20 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
+	{
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+									 PROGRESS_CHECKPOINT_PHASE_TRUNCATE_SUBTRANS);
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+	}
 
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_FINALIZE);
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(true);
 
+	/* Stop reporting progress of the restartpoint. */
+	pgstat_progress_end_command();
+
 	/* Reset the process title */
 	update_checkpoint_display(flags, true, true);
 
@@ -8958,3 +9034,29 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl->WalWriterSleeping = sleeping;
 	SpinLockRelease(&XLogCtl->info_lck);
 }
+
+/*
+ * Start reporting progress of the checkpoint.
+ */
+static void
+checkpoint_progress_start(int flags, int type)
+{
+	const int	index[] = {
+		PROGRESS_CHECKPOINT_TYPE,
+		PROGRESS_CHECKPOINT_FLAGS,
+		PROGRESS_CHECKPOINT_LSN,
+		PROGRESS_CHECKPOINT_START_TIMESTAMP,
+		PROGRESS_CHECKPOINT_PHASE
+		};
+	int64		val[5];
+
+	pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
+
+	val[0] = type;
+	val[1] = flags;
+	val[2] = RedoRecPtr;
+	val[3] = CheckpointStats.ckpt_start_t;
+	val[4] = PROGRESS_CHECKPOINT_PHASE_INIT;
+
+	pgstat_progress_update_multi_param(5, index, val);
+}
\ No newline at end of file
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..384ca35833 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1267,6 +1267,57 @@ CREATE VIEW pg_stat_progress_copy AS
     FROM pg_stat_get_progress_info('COPY') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_progress_checkpoint AS
+    SELECT
+        S.pid AS pid,
+        CASE S.param1 WHEN 1 THEN 'checkpoint'
+                      WHEN 2 THEN 'restartpoint'
+                      END AS type,
+        ( CASE WHEN (S.param2 & 4) > 0 THEN 'immediate ' ELSE '' END ||
+          CASE WHEN (S.param2 & 8) > 0 THEN 'force ' ELSE '' END ||
+          CASE WHEN (S.param2 & 16) > 0 THEN 'flush-all ' ELSE '' END ||
+          CASE WHEN (S.param2 & 32) > 0 THEN 'wait ' ELSE '' END ||
+          CASE WHEN (S.param2 & 128) > 0 THEN 'wal ' ELSE '' END ||
+          CASE WHEN (S.param2 & 256) > 0 THEN 'time ' ELSE '' END
+        ) AS flags,
+        ( '0/0'::pg_lsn +
+          ((CASE
+                WHEN S.param3 < 0 THEN pow(2::numeric, 64::numeric)::numeric
+                ELSE 0::numeric
+            END) +
+           S.param3::numeric)
+        ) AS start_lsn,
+        to_timestamp(946684800 + (S.param4::float8 / 1000000)) AS start_time,
+        CASE S.param5 WHEN 1 THEN 'initializing'
+                      WHEN 2 THEN 'getting virtual transaction IDs'
+                      WHEN 3 THEN 'checkpointing replication slots'
+                      WHEN 4 THEN 'checkpointing logical replication snapshot files'
+                      WHEN 5 THEN 'checkpointing logical rewrite mapping files'
+                      WHEN 6 THEN 'checkpointing replication origin'
+                      WHEN 7 THEN 'checkpointing commit log pages'
+                      WHEN 8 THEN 'checkpointing commit time stamp pages'
+                      WHEN 9 THEN 'checkpointing subtransaction pages'
+                      WHEN 10 THEN 'checkpointing multixact pages'
+                      WHEN 11 THEN 'checkpointing predicate lock pages'
+                      WHEN 12 THEN 'checkpointing buffers'
+                      WHEN 13 THEN 'processing file sync requests'
+                      WHEN 14 THEN 'performing two phase checkpoint'
+                      WHEN 15 THEN 'performing post checkpoint cleanup'
+                      WHEN 16 THEN 'invalidating replication slots'
+                      WHEN 17 THEN 'recycling old WAL files'
+                      WHEN 18 THEN 'truncating subtransactions'
+                      WHEN 19 THEN 'finalizing'
+                      END AS phase,
+        S.param6 AS buffers_total,
+        S.param7 AS buffers_processed,
+        S.param8 AS buffers_written,
+        S.param9 AS files_total,
+        S.param10 AS files_synced,
+        CASE S.param11 WHEN 0 THEN 'false'
+                       WHEN 1 THEN 'true'
+                       END AS new_requests
+    FROM pg_stat_get_progress_info('CHECKPOINT') AS S;
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..21bf75b058 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -39,6 +39,7 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogrecovery.h"
+#include "commands/progress.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -163,7 +164,7 @@ static pg_time_t last_xlog_switch_time;
 static void HandleCheckpointerInterrupts(void);
 static void CheckArchiveTimeout(void);
 static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
+static bool ImmediateCheckpointRequested(int flags);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
@@ -667,16 +668,24 @@ CheckArchiveTimeout(void)
  * there is one pending behind it.)
  */
 static bool
-ImmediateCheckpointRequested(void)
+ImmediateCheckpointRequested(int flags)
 {
 	volatile CheckpointerShmemStruct *cps = CheckpointerShmem;
 
+	if (cps->ckpt_flags & CHECKPOINT_REQUESTED)
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_NEW_REQUESTS, true);
+
 	/*
 	 * We don't need to acquire the ckpt_lck in this case because we're only
 	 * looking at a single flag bit.
 	 */
 	if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
+	{
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_FLAGS,
+									 (flags | CHECKPOINT_IMMEDIATE));
 		return true;
+	}
+
 	return false;
 }
 
@@ -708,7 +717,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
 		!ShutdownRequestPending &&
-		!ImmediateCheckpointRequested() &&
+		!ImmediateCheckpointRequested(flags) &&
 		IsCheckpointOnSchedule(progress))
 	{
 		if (ConfigReloadPending)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 73d30bf619..6d69255667 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -39,6 +39,7 @@
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
+#include "commands/progress.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -2026,6 +2027,8 @@ BufferSync(int flags)
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_TOTAL,
+								 num_to_scan);
 
 	/*
 	 * Sort buffers that need to be written to reduce the likelihood of random
@@ -2143,6 +2146,8 @@ BufferSync(int flags)
 		bufHdr = GetBufferDescriptor(buf_id);
 
 		num_processed++;
+		pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_PROCESSED,
+									 num_processed);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -2163,6 +2168,8 @@ BufferSync(int flags)
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				PendingCheckpointerStats.buf_written_checkpoints++;
 				num_written++;
+				pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_WRITTEN,
+											 num_written);
 			}
 		}
 
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..aa25215910 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -23,6 +23,7 @@
 #include "access/multixact.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "commands/progress.h"
 #include "commands/tablespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -368,6 +369,9 @@ ProcessSyncRequests(void)
 	/* Now scan the hashtable for fsync requests to process */
 	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOps);
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_FILES_TOTAL,
+								 hash_get_num_entries(pendingOps));
+
 	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
 	{
 		int			failures;
@@ -431,6 +435,8 @@ ProcessSyncRequests(void)
 						longest = elapsed;
 					total_elapsed += elapsed;
 					processed++;
+					pgstat_progress_update_param(PROGRESS_CHECKPOINT_FILES_SYNCED,
+												 processed);
 
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 96bffc0f2a..8281d961bf 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -497,6 +497,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 		cmdtype = PROGRESS_COMMAND_BASEBACKUP;
 	else if (pg_strcasecmp(cmd, "COPY") == 0)
 		cmdtype = PROGRESS_COMMAND_COPY;
+	 else if (pg_strcasecmp(cmd, "CHECKPOINT") == 0)
+		cmdtype = PROGRESS_COMMAND_CHECKPOINT;
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index a28938caf4..33a64d2f0b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -151,4 +151,42 @@
 #define PROGRESS_COPY_TYPE_PIPE 3
 #define PROGRESS_COPY_TYPE_CALLBACK 4
 
+/* Progress parameters for checkpoint */
+#define PROGRESS_CHECKPOINT_TYPE                    0
+#define PROGRESS_CHECKPOINT_FLAGS                   1
+#define PROGRESS_CHECKPOINT_LSN                     2
+#define PROGRESS_CHECKPOINT_START_TIMESTAMP         3
+#define PROGRESS_CHECKPOINT_PHASE                   4
+#define PROGRESS_CHECKPOINT_BUFFERS_TOTAL           5
+#define PROGRESS_CHECKPOINT_BUFFERS_PROCESSED       6
+#define PROGRESS_CHECKPOINT_BUFFERS_WRITTEN         7
+#define PROGRESS_CHECKPOINT_FILES_TOTAL             8
+#define PROGRESS_CHECKPOINT_FILES_SYNCED            9
+#define PROGRESS_CHECKPOINT_NEW_REQUESTS            10
+
+/* Types of checkpoint (as advertised via PROGRESS_CHECKPOINT_TYPE) */
+#define PROGRESS_CHECKPOINT_TYPE_CHECKPOINT         1
+#define PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT       2
+
+/* Phases of checkpoint (as advertised via PROGRESS_CHECKPOINT_PHASE) */
+#define PROGRESS_CHECKPOINT_PHASE_INIT                          1
+#define PROGRESS_CHECKPOINT_PHASE_GET_VIRTUAL_TRANSACTION_IDS   2
+#define PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS                   3
+#define PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS                     4
+#define PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS      5
+#define PROGRESS_CHECKPOINT_PHASE_REPLI_ORIGIN                  6
+#define PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES                    7
+#define PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES                8
+#define PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES                9
+#define PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES               10
+#define PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES          11
+#define PROGRESS_CHECKPOINT_PHASE_BUFFERS                       12
+#define PROGRESS_CHECKPOINT_PHASE_SYNC_FILES                    13
+#define PROGRESS_CHECKPOINT_PHASE_TWO_PHASE                     14
+#define PROGRESS_CHECKPOINT_PHASE_POST_CHECKPOINT_CLEANUP       15
+#define PROGRESS_CHECKPOINT_PHASE_INVALIDATE_REPLI_SLOTS        16
+#define PROGRESS_CHECKPOINT_PHASE_RECYCLE_OLD_XLOG              17
+#define PROGRESS_CHECKPOINT_PHASE_TRUNCATE_SUBTRANS             18
+#define PROGRESS_CHECKPOINT_PHASE_FINALIZE                      19
+
 #endif
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 47bf8029b0..02d51fb948 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -27,7 +27,8 @@ typedef enum ProgressCommandType
 	PROGRESS_COMMAND_CLUSTER,
 	PROGRESS_COMMAND_CREATE_INDEX,
 	PROGRESS_COMMAND_BASEBACKUP,
-	PROGRESS_COMMAND_COPY
+	PROGRESS_COMMAND_COPY,
+	PROGRESS_COMMAND_CHECKPOINT
 } ProgressCommandType;
 
 #define PGSTAT_NUM_PROGRESS_PARAM	20
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 624d0e5aae..eab68d7f14 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1913,6 +1913,76 @@ pg_stat_progress_basebackup| SELECT s.pid,
     s.param4 AS tablespaces_total,
     s.param5 AS tablespaces_streamed
    FROM pg_stat_get_progress_info('BASEBACKUP'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
+pg_stat_progress_checkpoint| SELECT s.pid,
+        CASE s.param1
+            WHEN 1 THEN 'checkpoint'::text
+            WHEN 2 THEN 'restartpoint'::text
+            ELSE NULL::text
+        END AS type,
+    (((((
+        CASE
+            WHEN ((s.param2 & (4)::bigint) > 0) THEN 'immediate '::text
+            ELSE ''::text
+        END ||
+        CASE
+            WHEN ((s.param2 & (8)::bigint) > 0) THEN 'force '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (16)::bigint) > 0) THEN 'flush-all '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (32)::bigint) > 0) THEN 'wait '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (128)::bigint) > 0) THEN 'wal '::text
+            ELSE ''::text
+        END) ||
+        CASE
+            WHEN ((s.param2 & (256)::bigint) > 0) THEN 'time '::text
+            ELSE ''::text
+        END) AS flags,
+    ('0/0'::pg_lsn + (
+        CASE
+            WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
+            ELSE (0)::numeric
+        END + (s.param3)::numeric)) AS start_lsn,
+    to_timestamp(((946684800)::double precision + ((s.param4)::double precision / (1000000)::double precision))) AS start_time,
+        CASE s.param5
+            WHEN 1 THEN 'initializing'::text
+            WHEN 2 THEN 'getting virtual transaction IDs'::text
+            WHEN 3 THEN 'checkpointing replication slots'::text
+            WHEN 4 THEN 'checkpointing logical replication snapshot files'::text
+            WHEN 5 THEN 'checkpointing logical rewrite mapping files'::text
+            WHEN 6 THEN 'checkpointing replication origin'::text
+            WHEN 7 THEN 'checkpointing commit log pages'::text
+            WHEN 8 THEN 'checkpointing commit time stamp pages'::text
+            WHEN 9 THEN 'checkpointing subtransaction pages'::text
+            WHEN 10 THEN 'checkpointing multixact pages'::text
+            WHEN 11 THEN 'checkpointing predicate lock pages'::text
+            WHEN 12 THEN 'checkpointing buffers'::text
+            WHEN 13 THEN 'processing file sync requests'::text
+            WHEN 14 THEN 'performing two phase checkpoint'::text
+            WHEN 15 THEN 'performing post checkpoint cleanup'::text
+            WHEN 16 THEN 'invalidating replication slots'::text
+            WHEN 17 THEN 'recycling old WAL files'::text
+            WHEN 18 THEN 'truncating subtransactions'::text
+            WHEN 19 THEN 'finalizing'::text
+            ELSE NULL::text
+        END AS phase,
+    s.param6 AS buffers_total,
+    s.param7 AS buffers_processed,
+    s.param8 AS buffers_written,
+    s.param9 AS files_total,
+    s.param10 AS files_synced,
+        CASE s.param11
+            WHEN 0 THEN 'false'::text
+            WHEN 1 THEN 'true'::text
+            ELSE NULL::text
+        END AS new_requests
+   FROM pg_stat_get_progress_info('CHECKPOINT'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
 pg_stat_progress_cluster| SELECT s.pid,
     s.datid,
     d.datname,

#83

Nitin Jadhav

nitinjadhavpostgres@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#82)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

v6 was not applying anymore, due to a change in
doc/src/sgml/ref/checkpoint.sgml done by b9eb0ff09e (Rename
pg_checkpointer predefined role to pg_checkpoint).

Please find attached a rebase in v7.

While working on this rebase, I also noticed that "pg_checkpointer" is
still mentioned in some translation files:

Thanks for rebasing the patch and sharing the information.
---

That said, back to this patch: I did not look closely but noticed that
the buffers_total reported by pg_stat_progress_checkpoint:

postgres=# select type,flags,start_lsn,phase,buffers_total,new_requests
from pg_stat_progress_checkpoint;
type | flags | start_lsn | phase
| buffers_total | new_requests
------------+-----------------------+------------+-----------------------+---------------+--------------
checkpoint | immediate force wait | 1/E6C523A8 | checkpointing
buffers | 1024275 | false
(1 row)

is a little bit different from what is logged once completed:

2022-11-04 08:18:50.806 UTC [3488442] LOG: checkpoint complete: wrote
1024278 buffers (97.7%);

This is because the count shown in the checkpoint complete message
includes the additional increment done during SlruInternalWritePage().
We are not sure of this increment until it really happens. Hence it
was not considered in the patch. To make it compatible with the
checkpoint complete message, we should increment all three here,
buffers_total, buffers_processed and buffers_written. So the total
number of buffers calculated earlier may not always be the same. If
this looks good, I will update this in the next patch.

Thanks & Regards,
Nitin Jadhav

On Fri, Nov 4, 2022 at 1:57 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Show quoted text

Hi,

On 7/28/22 11:38 AM, Nitin Jadhav wrote:

To understand the performance effects of the above, I have taken the
average of five checkpoints with the patch and without the patch in my
environment. Here are the results.
With patch: 269.65 s
Without patch: 269.60 s

Those look like timed checkpoints - if the checkpoints are sleeping a
part of the time, you're not going to see any potential overhead.

Yes. The above data is collected from timed checkpoints.

create table t1(a int);
insert into t1 select * from generate_series(1,10000000);

I generated a lot of data by using the above queries which would in
turn trigger the checkpoint (wal).
---

To see whether this has an effect you'd have to make sure there's a
certain number of dirty buffers (e.g. by doing CREATE TABLE AS
some_query) and then do a manual checkpoint and time how long that
times.

For this case I have generated data by using below queries.

create table t1(a int);
insert into t1 select * from generate_series(1,8000000);

This does not trigger the checkpoint automatically. I have issued the
CHECKPOINT manually and measured the performance by considering an
average of 5 checkpoints. Here are the details.

With patch: 2.457 s
Without patch: 2.334 s

Please share your thoughts.

v6 was not applying anymore, due to a change in
doc/src/sgml/ref/checkpoint.sgml done by b9eb0ff09e (Rename
pg_checkpointer predefined role to pg_checkpoint).

Please find attached a rebase in v7.

While working on this rebase, I also noticed that "pg_checkpointer" is
still mentioned in some translation files:
"
$ git grep pg_checkpointer
src/backend/po/de.po:msgid "must be superuser or have privileges of
pg_checkpointer to do CHECKPOINT"
src/backend/po/ja.po:msgid "must be superuser or have privileges of
pg_checkpointer to do CHECKPOINT"
src/backend/po/ja.po:msgstr
"CHECKPOINTを実行するにはスーパーユーザーであるか、またはpg_checkpointerの権限を持つ必要があります"
src/backend/po/sv.po:msgid "must be superuser or have privileges of
pg_checkpointer to do CHECKPOINT"
"

I'm not familiar with how the translation files are handled (looks like
they have their own set of commits, see 3c0bcdbc66 for example) but
wanted to mention that "pg_checkpointer" is still mentioned (even if
that may be expected as the last commit related to translation files
(aka 3c0bcdbc66) is older than the one that renamed pg_checkpointer to
pg_checkpoint (aka b9eb0ff09e)).

That said, back to this patch: I did not look closely but noticed that
the buffers_total reported by pg_stat_progress_checkpoint:

postgres=# select type,flags,start_lsn,phase,buffers_total,new_requests
from pg_stat_progress_checkpoint;
type | flags | start_lsn | phase
| buffers_total | new_requests
------------+-----------------------+------------+-----------------------+---------------+--------------
checkpoint | immediate force wait | 1/E6C523A8 | checkpointing
buffers | 1024275 | false
(1 row)

is a little bit different from what is logged once completed:

2022-11-04 08:18:50.806 UTC [3488442] LOG: checkpoint complete: wrote
1024278 buffers (97.7%);

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#84

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#82)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Fri, Nov 4, 2022 at 4:27 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Please find attached a rebase in v7.

I don't think it's a good thing that this patch is using the
progress-reporting machinery. The point of that machinery is that we
want any backend to be able to report progress for any command it
happens to be running, and we don't know which command that will be at
any given point in time, or how many backends will be running any
given command at once. So we need some generic set of counters that
can be repurposed for whatever any particular backend happens to be
doing right at the moment.

But none of that applies to the checkpointer. Any information about
the checkpointer that we want to expose can just be advertised in a
dedicated chunk of shared memory, perhaps even by simply adding it to
CheckpointerShmemStruct. Then you can give the fields whatever names,
types, and sizes you like, and you don't have to do all of this stuff
with mapping down to integers and back. The only real disadvantage
that I can see is then you have to think a bit harder about what the
concurrency model is here, and maybe you end up reimplementing
something similar to what the progress-reporting stuff does for you,
and *maybe* that is a sufficient reason to do it this way.

But I'm doubtful. This feels like a square-peg-round-hole situation.

--
Robert Haas
EDB: http://www.enterprisedb.com

#85

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Drouvot, Bertrand (#82)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

On 2022-11-04 09:25:52 +0100, Drouvot, Bertrand wrote:

@@ -7023,29 +7048,63 @@ static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS);
CheckPointReplicationSlots();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS);
CheckPointSnapBuild();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS);
CheckPointLogicalRewriteHeap();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_REPLI_ORIGIN);
CheckPointReplicationOrigin();

/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES);
CheckPointCLOG();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES);
CheckPointCommitTs();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES);
CheckPointSUBTRANS();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES);
CheckPointMultiXact();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES);
CheckPointPredicate();
+
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_BUFFERS);
CheckPointBuffers(flags);

/* Perform all queued up fsyncs */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_SYNC_FILES);
ProcessSyncRequests();
CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();

/* We deliberately delay 2PC checkpointing as long as possible */
+	pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
+								 PROGRESS_CHECKPOINT_PHASE_TWO_PHASE);
CheckPointTwoPhase(checkPointRedo);
}

This is quite the code bloat. Can we make this less duplicative?

+CREATE VIEW pg_stat_progress_checkpoint AS
+    SELECT
+        S.pid AS pid,
+        CASE S.param1 WHEN 1 THEN 'checkpoint'
+                      WHEN 2 THEN 'restartpoint'
+                      END AS type,
+        ( CASE WHEN (S.param2 & 4) > 0 THEN 'immediate ' ELSE '' END ||
+          CASE WHEN (S.param2 & 8) > 0 THEN 'force ' ELSE '' END ||
+          CASE WHEN (S.param2 & 16) > 0 THEN 'flush-all ' ELSE '' END ||
+          CASE WHEN (S.param2 & 32) > 0 THEN 'wait ' ELSE '' END ||
+          CASE WHEN (S.param2 & 128) > 0 THEN 'wal ' ELSE '' END ||
+          CASE WHEN (S.param2 & 256) > 0 THEN 'time ' ELSE '' END
+        ) AS flags,
+        ( '0/0'::pg_lsn +
+          ((CASE
+                WHEN S.param3 < 0 THEN pow(2::numeric, 64::numeric)::numeric
+                ELSE 0::numeric
+            END) +
+           S.param3::numeric)
+        ) AS start_lsn,

I don't think we should embed this much complexity in the view
defintions. It's hard to read, bloats the catalog, we can't fix them once
released. This stuff seems like it should be in a helper function.

I don't have any iea what that pow stuff is supposed to be doing.

+ to_timestamp(946684800 + (S.param4::float8 / 1000000)) AS start_time,

I don't think this is a reasonable path - embedding way too much low-level
details about the timestamp format in the view definition. Why do we need to
do this?

Greetings,

Andres Freund

#86

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 3 years ago

In reply to: Robert Haas (#84)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Wed, Nov 16, 2022 at 1:35 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Nov 4, 2022 at 4:27 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Please find attached a rebase in v7.

I don't think it's a good thing that this patch is using the
progress-reporting machinery. The point of that machinery is that we
want any backend to be able to report progress for any command it
happens to be running, and we don't know which command that will be at
any given point in time, or how many backends will be running any
given command at once. So we need some generic set of counters that
can be repurposed for whatever any particular backend happens to be
doing right at the moment.

Hm.

But none of that applies to the checkpointer. Any information about
the checkpointer that we want to expose can just be advertised in a
dedicated chunk of shared memory, perhaps even by simply adding it to
CheckpointerShmemStruct. Then you can give the fields whatever names,
types, and sizes you like, and you don't have to do all of this stuff
with mapping down to integers and back. The only real disadvantage
that I can see is then you have to think a bit harder about what the
concurrency model is here, and maybe you end up reimplementing
something similar to what the progress-reporting stuff does for you,
and *maybe* that is a sufficient reason to do it this way.

-1 for CheckpointerShmemStruct as it is being used for running
checkpoints and I don't think adding stats to it is a great idea.
Instead, extending PgStat_CheckpointerStats and using shared memory
stats for reporting progress/last checkpoint related stats is a good
idea IMO. I also think that a new pg_stat_checkpoint view is needed
because, right now, the PgStat_CheckpointerStats stats are exposed via
the pg_stat_bgwriter view, having a separate view for checkpoint stats
is good here. Also, removing CheckpointStatsData and moving all of
those members to PgStat_CheckpointerStats, of course, by being careful
about the amount of shared memory required, is also a good idea IMO.
Going forward, PgStat_CheckpointerStats and pg_stat_checkpoint view
can be a single point of location for all the checkpoint related
stats.

Thoughts?

In fact, I was recently having an off-list chat with Bertrand Drouvot
about the above idea.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#87

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Bharath Rupireddy (#86)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

On 2022-11-16 16:01:55 +0530, Bharath Rupireddy wrote:

-1 for CheckpointerShmemStruct as it is being used for running
checkpoints and I don't think adding stats to it is a great idea.

Why? Imo the data needed for progress reporting aren't really "stats". We'd
not accumulate counters over time, just for the current checkpoint.

I think it might even be useful for other parts of the system to know what the
checkpointer is doing, e.g. bgwriter or autovacuum could adapt the behaviour
if checkpointer can't keep up. Somehow it'd feel wrong to use the stats system
as the source of such adjustments - but perhaps my gut feeling on that isn't
right.

The best argument for combining progress reporting with accumulating stats is
that we could likely share some of the code. Having accumulated stats for all
the checkpoint phases would e.g. be quite valuable.

Instead, extending PgStat_CheckpointerStats and using shared memory
stats for reporting progress/last checkpoint related stats is a good
idea IMO

There's certainly some potential for deduplicating state and to make stats
updated more frequently. But that doesn't necessarily mean that putting the
checkpoint progress into PgStat_CheckpointerStats is a good idea (nor the
opposite).

I also think that a new pg_stat_checkpoint view is needed
because, right now, the PgStat_CheckpointerStats stats are exposed via
the pg_stat_bgwriter view, having a separate view for checkpoint stats
is good here.

I agree that we should do that, but largely independent of the architectural
question at hand.

Greetings,

Andres Freund

#88

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Bharath Rupireddy (#86)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Wed, Nov 16, 2022 at 5:32 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

-1 for CheckpointerShmemStruct as it is being used for running
checkpoints and I don't think adding stats to it is a great idea.
Instead, extending PgStat_CheckpointerStats and using shared memory
stats for reporting progress/last checkpoint related stats is a good
idea IMO.

I agree with Andres: progress reporting isn't really quite the same
thing as stats, and either place seems like it could be reasonable. I
don't presently have an opinion on which is a better fit, but I don't
think the fact that CheckpointerShmemStruct is used for running
checkpoints rules anything out. Progress reporting is *also* about
running checkpoints. Any historical data you want to expose might not
be about running checkpoints, but, uh, so what? I don't really see
that as a strong argument against it fitting into this struct.

I also think that a new pg_stat_checkpoint view is needed
because, right now, the PgStat_CheckpointerStats stats are exposed via
the pg_stat_bgwriter view, having a separate view for checkpoint stats
is good here.

Yep.

Also, removing CheckpointStatsData and moving all of
those members to PgStat_CheckpointerStats, of course, by being careful
about the amount of shared memory required, is also a good idea IMO.
Going forward, PgStat_CheckpointerStats and pg_stat_checkpoint view
can be a single point of location for all the checkpoint related
stats.

I'm not sure that I completely follow this part, or that I agree with
it. I have never really understood why we drive background writer or
checkpointer statistics through the statistics collector. Here again,
for things like table statistics, there is no choice, because we could
have an unbounded number of tables and need to keep statistics about
all of them. The statistics collector can handle that by allocating
more memory as required. But there is only one background writer and
only one checkpointer, so that is not needed in those cases. Why not
just have them expose anything they want to expose through shared
memory directly?

If the statistics collector provides services that we care about, like
persisting data across restarts or making snapshots for transactional
behavior, then those might be reasons to go through it even for the
background writer or checkpointer. But if so, we should be explicit
about what the reasons are, both in the mailing list discussion and in
code comments. Otherwise I fear that we'll just end up doing something
in a more complicated way than is really necessary.

--
Robert Haas
EDB: http://www.enterprisedb.com

#89

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Robert Haas (#88)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

On 2022-11-16 14:19:32 -0500, Robert Haas wrote:

I have never really understood why we drive background writer or
checkpointer statistics through the statistics collector.

To some degree it is required for durability - the stats system needs to know
how to write out those stats. But that wasn't ever a good reason to send
messages to the stats collector - it could just read the stats from shared
memory after all.

There's also integration with snapshots of the stats, resetting them, etc.

There's also the complexity that some of the stats e.g. for checkpointer
aren't about work the checkpointer did, but just have ended up there for
historical raisins. E.g. the number of fsyncs and writes done by backends.

See below:

Here again, for things like table statistics, there is no choice, because we
could have an unbounded number of tables and need to keep statistics about
all of them. The statistics collector can handle that by allocating more
memory as required. But there is only one background writer and only one
checkpointer, so that is not needed in those cases. Why not just have them
expose anything they want to expose through shared memory directly?

That's how it is in 15+. The memory for "fixed-numbered" or "global"
statistics are maintained by the stats system, but in plain shared memory,
allocated at server start. Not via the hash table.

Right now stats updates for the checkpointer use the "changecount" approach to
updates. For now that makes sense, because we update the stats only
occasionally (after a checkpoint or when writing in CheckpointWriteDelay()) -
a stats viewer seeing the checkpoint count go up, without yet seeing the
corresponding buffers written would be misleading.

I don't think we'd want every buffer write or whatnot go through the
changecount mechanism, on some non-x86 platforms that could be noticable. But
if we didn't stage the stats updates locally I think we could make most of the
stats changes without that overhead. For updates that just increment a single
counter there's simply no benefit in the changecount mechanism afaict.

I didn't want to do that change during the initial shared memory stats work,
it already was bigger than I could handle...

It's not quite clear to me what the best path forward is for
buf_written_backend / buf_fsync_backend, which currently are reported via the
checkpointer stats. I think the best path might be to stop counting them via
the CheckpointerShmem->num_backend_writes etc and just populate the fields in
the view (for backward compat) via the proposed [1]/messages/by-id/CAOtHd0ApHna7_p6mvHoO+gLZdxjaQPRemg3_o0a4ytCPijLytQ@mail.gmail.com pg_stat_io patch. Doing
that accounting with CheckpointerCommLock held exclusively isn't free.

If the statistics collector provides services that we care about, like
persisting data across restarts or making snapshots for transactional
behavior, then those might be reasons to go through it even for the
background writer or checkpointer. But if so, we should be explicit
about what the reasons are, both in the mailing list discussion and in
code comments. Otherwise I fear that we'll just end up doing something
in a more complicated way than is really necessary.

I tried to provide at least some of that in the comments at the start of
pgstat.c in 15+. There's very likely more that should be added, but I think
it's a decent start.

Greetings,

Andres Freund

[1]: /messages/by-id/CAOtHd0ApHna7_p6mvHoO+gLZdxjaQPRemg3_o0a4ytCPijLytQ@mail.gmail.com

#90

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 3 years ago

In reply to: Robert Haas (#88)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Thu, Nov 17, 2022 at 12:49 AM Robert Haas <robertmhaas@gmail.com> wrote:

I also think that a new pg_stat_checkpoint view is needed
because, right now, the PgStat_CheckpointerStats stats are exposed via
the pg_stat_bgwriter view, having a separate view for checkpoint stats
is good here.

Yep.

On Wed, Nov 16, 2022 at 11:44 PM Andres Freund <andres@anarazel.de> wrote:

I also think that a new pg_stat_checkpoint view is needed
because, right now, the PgStat_CheckpointerStats stats are exposed via
the pg_stat_bgwriter view, having a separate view for checkpoint stats
is good here.

I agree that we should do that, but largely independent of the architectural
question at hand.

Thanks. I quickly prepared a patch introducing pg_stat_checkpointer
view and posted it here -
/messages/by-id/CALj2ACVxX2ii=66RypXRweZe2EsBRiPMj0aHfRfHUeXJcC7kHg@mail.gmail.com.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#91

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Andres Freund (#89)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Wed, Nov 16, 2022 at 2:52 PM Andres Freund <andres@anarazel.de> wrote:

I don't think we'd want every buffer write or whatnot go through the
changecount mechanism, on some non-x86 platforms that could be noticable. But
if we didn't stage the stats updates locally I think we could make most of the
stats changes without that overhead. For updates that just increment a single
counter there's simply no benefit in the changecount mechanism afaict.

You might be right, but I'm not sure whether it's worth stressing
about. The progress reporting mechanism uses the st_changecount
mechanism, too, and as far as I know nobody's complained about that
having too much overhead. Maybe they have, though, and I've just
missed it.

--
Robert Haas
EDB: http://www.enterprisedb.com

#92

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Robert Haas (#91)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

On 2022-11-17 09:03:32 -0500, Robert Haas wrote:

On Wed, Nov 16, 2022 at 2:52 PM Andres Freund <andres@anarazel.de> wrote:

I don't think we'd want every buffer write or whatnot go through the
changecount mechanism, on some non-x86 platforms that could be noticable. But
if we didn't stage the stats updates locally I think we could make most of the
stats changes without that overhead. For updates that just increment a single
counter there's simply no benefit in the changecount mechanism afaict.

You might be right, but I'm not sure whether it's worth stressing
about. The progress reporting mechanism uses the st_changecount
mechanism, too, and as far as I know nobody's complained about that
having too much overhead. Maybe they have, though, and I've just
missed it.

I've seen it in profiles, although not as the major contributor. Most places
do a reasonable amount of work between calls though.

As an experiment, I added a progress report to BufferSync()'s first loop
(i.e. where it checks all buffers). On a 128GB shared_buffers cluster that
increases the time for a do-nothing checkpoint from ~235ms to ~280ms. If I
remove the changecount stuff and use a single write + write barrier, it ends
up as 250ms. Inlining brings it down a bit further, to 247ms.

Obviously this is a very extreme case - we only do very little work between
the progress report calls. But it does seem to show that the overhead is not
entirely neglegible.

I think pgstat_progress_start_command() needs the changecount stuff, as does
pgstat_progress_update_multi_param(). But for anything updating a single
parameter at a time it really doesn't do anything useful on a platform that
doesn't tear 64bit writes (so it could be #ifdef
PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY).

Out of further curiosity I wanted to test the impact when the loop doesn't
even do a LockBufHdr() and added an unlocked pre-check. 109ms without
progress. 138ms with. 114ms with the simplified
pgstat_progress_update_param(). 108ms after inlining the simplified
pgstat_progress_update_param().

Greetings,

Andres Freund

#93

Tom Lane

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Andres Freund (#92)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Andres Freund <andres@anarazel.de> writes:

I think pgstat_progress_start_command() needs the changecount stuff, as does
pgstat_progress_update_multi_param(). But for anything updating a single
parameter at a time it really doesn't do anything useful on a platform that
doesn't tear 64bit writes (so it could be #ifdef
PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY).

Seems safe to restrict it to that case.

regards, tom lane

#94

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Andres Freund (#92)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Thu, Nov 17, 2022 at 11:24 AM Andres Freund <andres@anarazel.de> wrote:

As an experiment, I added a progress report to BufferSync()'s first loop
(i.e. where it checks all buffers). On a 128GB shared_buffers cluster that
increases the time for a do-nothing checkpoint from ~235ms to ~280ms. If I
remove the changecount stuff and use a single write + write barrier, it ends
up as 250ms. Inlining brings it down a bit further, to 247ms.

OK, I'd say that's pretty good evidence that we can't totally
disregard the issue.

--
Robert Haas
EDB: http://www.enterprisedb.com

#95

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Drouvot, Bertrand (#82)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

On 2022-11-04 09:25:52 +0100, Drouvot, Bertrand wrote:

Please find attached a rebase in v7.

cfbot complains that the docs don't build:
https://cirrus-ci.com/task/6694349031866368?logs=docs_build#L296

[03:24:27.317] ref/checkpoint.sgml:66: element para: validity error : Element para is not declared in para list of possible children

I've marked the patch as waitin-on-author for now.

There's been a bunch of architectural feedback too, but tbh, I don't know if
we came to any conclusion on that front...

Greetings,

Andres Freund

#96

vignesh C

vignesh21@gmail.com

almost 3 years ago

In reply to: Andres Freund (#95)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

On Thu, 8 Dec 2022 at 00:33, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-11-04 09:25:52 +0100, Drouvot, Bertrand wrote:

Please find attached a rebase in v7.

cfbot complains that the docs don't build:
https://cirrus-ci.com/task/6694349031866368?logs=docs_build#L296

[03:24:27.317] ref/checkpoint.sgml:66: element para: validity error : Element para is not declared in para list of possible children

I've marked the patch as waitin-on-author for now.

There's been a bunch of architectural feedback too, but tbh, I don't know if
we came to any conclusion on that front...

There has been no updates on this thread for some time, so this has
been switched as Returned with Feedback. Feel free to open it in the
next commitfest if you plan to continue on this.

Regards,
Vignesh

#97

Nitin Jadhav

nitinjadhavpostgres@gmail.com

over 1 year ago

In reply to: vignesh C (#96)

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

Hi,

It’s been a long gap in the activity of this thread, and I apologize
for the delay. However, I have now returned and reviewed the other
threads [1]/messages/by-id/CAOtHd0ApHna7_p6mvHoO+gLZdxjaQPRemg3_o0a4ytCPijLytQ@mail.gmail.com,[2]/messages/by-id/CALj2ACVxX2ii=66RypXRweZe2EsBRiPMj0aHfRfHUeXJcC7kHg@mail.gmail.com that have made changes in this area. I would like to
share a summary of the discussion that took place among Robert,
Andres, Bharath, and Tom on this thread, to make it easier to move
forward.

Robert was dissatisfied with the approach used in the patch to report
progress for the checkpointer process, as he felt the current
mechanism is not suitable. He proposed allocating a dedicated chunk of
shared memory in CheckpointerShmemStruct. Bharath opposed this,
suggesting instead to use PgStat_CheckpointerStats. Andres somewhat
supported Robert’s idea but noted that using PgStat_CheckpointerStats
would allow for more code reuse.

The discussion then shifted towards the statistics handling for the
checkpointer process. Robert expressed dissatisfaction with the
current statistics handling mechanism. Andres explained the rationale
behind the existing setup and the improvements made in pg_stat_io. He
also mentioned the overhead of the changecount mechanism when updating
for every buffer write. However, for updates involving a single
parameter at a time, performance can be improved on platforms that
support atomic 64-bit writes (indicated by #ifdef
PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY). He also shared performance
metrics demonstrating good results with this approach. Tom agreed to
use this and restrict it to the specific case.

But I am not quite clear on the direction ahead. Let me summarise the
approaches based on the above discussion.

Approach-1: The approach used in the current patch which uses the
existing mechanism of progress reporting. The advantage of this
approach is that the machinery is already in place and ready to use.
However, it is not suitable for the checkpointer process because only
the checkpointer process runs the checkpoint, even if the command is
issued from a different backend. The current mechanism is designed for
any backend to report progress for any command it is running, and we
don’t know which command that will be at any given point in time, or
how many backends will be running any given command simultaneously.
Hence, this approach does not fit the checkpointer. Additionally,
there is complexity involved in mapping down to integers and back.

Approach-2: Allocate a dedicated chunk of shared memory in
CheckpointerShmemStruct with an appropriate name and size. This
approach eliminates the complexity involved in Approach-1 related to
mapping down to integers and back. However, it requires building the
necessary machinery to suit checkpointer progress reporting which
might be similar to the current progress reporting mechanism.

Approach-3: Using PgStat_CheckpointerStats to store the progress
information. Have we completely ruled out this approach?

Additionally all three approaches require improvements in the
changecount mechanism on platforms that support atomic 64-bit writes.

I’m inclined to favor Approach-2 because it provides a clearer method
for reporting progress for the checkpointer process, with the
additional work required to implement the necessary machinery.
However, I’m still uncertain about the best path forward. Please share
your thoughts.

[1]: /messages/by-id/CAOtHd0ApHna7_p6mvHoO+gLZdxjaQPRemg3_o0a4ytCPijLytQ@mail.gmail.com
[2]: /messages/by-id/CALj2ACVxX2ii=66RypXRweZe2EsBRiPMj0aHfRfHUeXJcC7kHg@mail.gmail.com

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

Show quoted text

On Tue, Jan 31, 2023 at 11:16 PM vignesh C <vignesh21@gmail.com> wrote:

On Thu, 8 Dec 2022 at 00:33, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-11-04 09:25:52 +0100, Drouvot, Bertrand wrote:

Please find attached a rebase in v7.

cfbot complains that the docs don't build:
https://cirrus-ci.com/task/6694349031866368?logs=docs_build#L296

[03:24:27.317] ref/checkpoint.sgml:66: element para: validity error : Element para is not declared in para list of possible children

I've marked the patch as waitin-on-author for now.

There's been a bunch of architectural feedback too, but tbh, I don't know if
we came to any conclusion on that front...

There has been no updates on this thread for some time, so this has
been switched as Returned with Feedback. Feel free to open it in the
next commitfest if you plan to continue on this.

Regards,
Vignesh