About to add WAL write/fsync statistics to pg_stat_wal view

Started by Masahiro Ikedaabout 5 years ago67 messages

ikedamsh@oss.nttdata.com

about 5 years ago

1 attachment(s)

Hi,

I propose to add wal write/fsync statistics to pg_stat_wal view.
It's useful not only for developing/improving source code related to WAL
but also for users to detect workload changes, HW failure, and so on.

I introduce "track_wal_io_timing" parameter and provide the following
information on pg_stat_wal view.
I separate the parameter from "track_io_timing" to "track_wal_io_timing"
because IIUC, WAL I/O activity may have a greater impact on query
performance than database I/O activity.

```
postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time
FROM pg_stat_wal;
-[ RECORD 1 ]--+----
wal_write | 650 # Total number of times WAL data was written to
the disk

wal_write_time | 43 # Total amount of time that has been spent in the
portion of WAL data was written to disk
# if track-wal-io-timing is enabled, otherwise
zero

wal_sync | 78 # Total number of times WAL data was synced to the
disk

wal_sync_time | 104 # Total amount of time that has been spent in the
portion of WAL data was synced to disk
# if track-wal-io-timing is enabled, otherwise
zero
```

What do you think?
Please let me know your comments.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

0001_add_wal_io_activity_to_the_pg_stat_wal.patchtext/x-diff; name=0001_add_wal_io_activity_to_the_pg_stat_wal.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8cd3d690..ba923a2b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7388,6 +7388,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        because it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 52a69a53..ce4f652d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3479,7 +3479,51 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>wal_buffers_full</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of times WAL data was written to the disk because WAL buffers got full
+       Total number of times WAL data was written to the disk because WAL buffers got full
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was written to the disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was written to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero).
+       To avoid standby server's performance degradation, they don't collect
+       this statistics
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was synced to the disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was synced to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero).
+       To avoid standby server's performance degradation, they don't collect
+       this statistics
       </para></entry>
      </row>
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7e81ce4f..c9f33d23 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -109,6 +109,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2528,6 +2529,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
+			instr_time	duration;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2536,9 +2539,24 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure i/o timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				if (track_wal_io_timing)
+				{
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time += INSTR_TIME_GET_MILLISEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10433,6 +10451,27 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 	}
 }
 
+/*
+ * Check if fsync mothod is called.
+ */
+bool
+fsyncMethodCalled()
+{
+	if (!enableFsync)
+		return 0;
+
+	switch (sync_method)
+	{
+		case SYNC_METHOD_FSYNC:
+		case SYNC_METHOD_FSYNC_WRITETHROUGH:
+		case SYNC_METHOD_FDATASYNC:
+			return 1;
+		default:
+			/* others don't have a specific fsync method */
+			return 0;
+	}
+}
+
 
 /*
  * Issue appropriate kind of fsync (if any) for an XLOG output file.
@@ -10443,8 +10482,19 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
+	instr_time	start;
+	instr_time	duration;
 	char	   *msg = NULL;
 
+	/*
+	 * Measure i/o timing to fsync WAL data.
+	 *
+	 * The wal receiver skip to collect it to avoid performance degradation of standy servers.
+	 * If sync_method doesn't have its fsync method, to skip too.
+	 */
+	if (!AmWalReceiverProcess() && track_wal_io_timing && fsyncMethodCalled())
+		INSTR_TIME_SET_CURRENT(start);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
 	{
@@ -10488,6 +10538,19 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/* increment the i/o timing and the number of times to fsync WAL data */
+	if (fsyncMethodCalled())
+	{
+		if (!AmWalReceiverProcess() && track_wal_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_sync_time += INSTR_TIME_GET_MILLISEC(duration);
+		}
+
+		WalStats.m_wal_sync++;
+	}
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b140c210..da4e8139 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -997,6 +997,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7c75a25d..4bf83e4c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -6801,6 +6801,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index a52832fe..ce9f4b7c 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,6 +253,9 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/* Send WAL statistics */
+		pgstat_send_wal();
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6afe1b6f..711a30ab 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1703,7 +1703,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1724,7 +1724,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1744,7 +1752,17 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+
+	/* time is already in msec, just convert to double for presentation */
+	values[5] = Float8GetDatum((double) wal_stats->wal_write_time);
+
+	values[6] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* time is already in msec, just convert to double for presentation */
+	values[7] = Float8GetDatum((double) wal_stats->wal_sync_time);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 635d91d5..fc24745a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1482,6 +1482,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9c9091e6..64da738b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -583,6 +583,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e..ce695708 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fc2202b8..91673385 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5497,13 +5497,13 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
-{ oid => '1136', descr => 'statistics: information about WAL activity',
-  proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
-  proparallel => 'r', prorettype => 'record', proargtypes => '',
-   proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-   proargmodes => '{o,o,o,o,o}',
-   proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
-  prosrc => 'pg_stat_get_wal' },
+ { oid => '1136', descr => 'statistics: information about WAL activity',
+   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
+   proparallel => 'r', prorettype => 'record', proargtypes => '',
+   proallargtypes => '{int8,int8,numeric,int8,int8,float8,int8,float8,timestamptz}',
+   proargmodes => '{o,o,o,o,o,o,o,o,o}',
+   proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_write_time,wal_sync,wal_sync_time,stats_reset}',
+   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
   proname => 'pg_stat_get_slru', prorows => '100', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5954068d..a63619bc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -463,6 +463,10 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_write_time;	/* accumulate times in milliseconds */
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_sync_time;	/* accumulate times in milliseconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -805,6 +809,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6293ab57..b3bf1216 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2142,8 +2142,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_write_time,
+    w.wal_sync,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_write_time, wal_sync, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,

Li Japin

japinli@hotmail.com

about 5 years ago

In reply to: Masahiro Ikeda (#1)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

Hi,

On Dec 8, 2020, at 1:06 PM, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

Hi,

I propose to add wal write/fsync statistics to pg_stat_wal view.
It's useful not only for developing/improving source code related to WAL
but also for users to detect workload changes, HW failure, and so on.

I introduce "track_wal_io_timing" parameter and provide the following information on pg_stat_wal view.
I separate the parameter from "track_io_timing" to "track_wal_io_timing"
because IIUC, WAL I/O activity may have a greater impact on query performance than database I/O activity.

```
postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time FROM pg_stat_wal;
-[ RECORD 1 ]--+----
wal_write | 650 # Total number of times WAL data was written to the disk

wal_write_time | 43 # Total amount of time that has been spent in the portion of WAL data was written to disk
# if track-wal-io-timing is enabled, otherwise zero

wal_sync | 78 # Total number of times WAL data was synced to the disk

wal_sync_time | 104 # Total amount of time that has been spent in the portion of WAL data was synced to disk
# if track-wal-io-timing is enabled, otherwise zero
```

What do you think?
Please let me know your comments.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATION<0001_add_wal_io_activity_to_the_pg_stat_wal.patch>

There is a no previous prototype warning for ‘fsyncMethodCalled’, and it now only used in xlog.c,
should we declare with static? And this function wants a boolean as a return, should we use
true/false other than 0/1?

+/*
+ * Check if fsync mothod is called.
+ */
+bool
+fsyncMethodCalled()
+{
+       if (!enableFsync)
+               return 0;
+
+       switch (sync_method)
+       {
+               case SYNC_METHOD_FSYNC:
+               case SYNC_METHOD_FSYNC_WRITETHROUGH:
+               case SYNC_METHOD_FDATASYNC:
+                       return 1;
+               default:
+                       /* others don't have a specific fsync method */
+                       return 0;
+       }
+}
+

--
Best regards
ChengDu WenWu Information Technology Co.,Ltd.
Japin Li

Masahiro Ikeda

ikedamsh@oss.nttdata.com

about 5 years ago

In reply to: Li Japin (#2)

1 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2020-12-08 16:45, Li Japin wrote:

Hi,

On Dec 8, 2020, at 1:06 PM, Masahiro Ikeda <ikedamsh@oss.nttdata.com>
wrote:

Hi,

I propose to add wal write/fsync statistics to pg_stat_wal view.
It's useful not only for developing/improving source code related to
WAL
but also for users to detect workload changes, HW failure, and so on.

I introduce "track_wal_io_timing" parameter and provide the following
information on pg_stat_wal view.
I separate the parameter from "track_io_timing" to
"track_wal_io_timing"
because IIUC, WAL I/O activity may have a greater impact on query
performance than database I/O activity.

```
postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time
FROM pg_stat_wal;
-[ RECORD 1 ]--+----
wal_write | 650 # Total number of times WAL data was written to
the disk

wal_write_time | 43 # Total amount of time that has been spent in
the portion of WAL data was written to disk
# if track-wal-io-timing is enabled, otherwise
zero

wal_sync | 78 # Total number of times WAL data was synced to
the disk

wal_sync_time | 104 # Total amount of time that has been spent in
the portion of WAL data was synced to disk
# if track-wal-io-timing is enabled, otherwise
zero
```

What do you think?
Please let me know your comments.

Regards
--
Masahiro Ikeda
NTT DATA
CORPORATION<0001_add_wal_io_activity_to_the_pg_stat_wal.patch>

There is a no previous prototype warning for ‘fsyncMethodCalled’, and
it now only used in xlog.c,
should we declare with static? And this function wants a boolean as a
return, should we use
true/false other than 0/1?
+/*
+ * Check if fsync mothod is called.
+ */
+bool
+fsyncMethodCalled()
+{
+       if (!enableFsync)
+               return 0;
+
+       switch (sync_method)
+       {
+               case SYNC_METHOD_FSYNC:
+               case SYNC_METHOD_FSYNC_WRITETHROUGH:
+               case SYNC_METHOD_FDATASYNC:
+                       return 1;
+               default:
+                       /* others don't have a specific fsync method */
+                       return 0;
+       }
+}
+

Thanks for your review.
I agree with your comments. I fixed them.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

0002_add_wal_io_activity_to_the_pg_stat_wal.patchtext/x-diff; name=0002_add_wal_io_activity_to_the_pg_stat_wal.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8cd3d690..ba923a2b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7388,6 +7388,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        because it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 52a69a53..ce4f652d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3479,7 +3479,51 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>wal_buffers_full</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of times WAL data was written to the disk because WAL buffers got full
+       Total number of times WAL data was written to the disk because WAL buffers got full
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was written to the disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was written to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero).
+       To avoid standby server's performance degradation, they don't collect
+       this statistics
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was synced to the disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was synced to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero).
+       To avoid standby server's performance degradation, they don't collect
+       this statistics
       </para></entry>
      </row>
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7e81ce4f..18d6ecc5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -109,6 +109,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -970,6 +971,8 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static bool fsyncMethodCalled();
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -2528,6 +2531,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
+			instr_time	duration;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2536,9 +2541,24 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure i/o timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				if (track_wal_io_timing)
+				{
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time += INSTR_TIME_GET_MILLISEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10433,6 +10453,27 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 	}
 }
 
+/*
+ * Check if fsync mothod is called.
+ */
+static bool
+fsyncMethodCalled()
+{
+	if (!enableFsync)
+		return false;
+
+	switch (sync_method)
+	{
+		case SYNC_METHOD_FSYNC:
+		case SYNC_METHOD_FSYNC_WRITETHROUGH:
+		case SYNC_METHOD_FDATASYNC:
+			return true;
+		default:
+			/* others don't have a specific fsync method */
+			return false;
+	}
+}
+
 
 /*
  * Issue appropriate kind of fsync (if any) for an XLOG output file.
@@ -10443,8 +10484,19 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
+	instr_time	start;
+	instr_time	duration;
 	char	   *msg = NULL;
 
+	/*
+	 * Measure i/o timing to fsync WAL data.
+	 *
+	 * The wal receiver skip to collect it to avoid performance degradation of standy servers.
+	 * If sync_method doesn't have its fsync method, to skip too.
+	 */
+	if (!AmWalReceiverProcess() && track_wal_io_timing && fsyncMethodCalled())
+		INSTR_TIME_SET_CURRENT(start);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
 	{
@@ -10488,6 +10540,19 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/* increment the i/o timing and the number of times to fsync WAL data */
+	if (fsyncMethodCalled())
+	{
+		if (!AmWalReceiverProcess() && track_wal_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_sync_time += INSTR_TIME_GET_MILLISEC(duration);
+		}
+
+		WalStats.m_wal_sync++;
+	}
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b140c210..da4e8139 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -997,6 +997,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7c75a25d..4bf83e4c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -6801,6 +6801,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index a52832fe..ce9f4b7c 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,6 +253,9 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/* Send WAL statistics */
+		pgstat_send_wal();
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6afe1b6f..711a30ab 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1703,7 +1703,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1724,7 +1724,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1744,7 +1752,17 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+
+	/* time is already in msec, just convert to double for presentation */
+	values[5] = Float8GetDatum((double) wal_stats->wal_write_time);
+
+	values[6] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* time is already in msec, just convert to double for presentation */
+	values[7] = Float8GetDatum((double) wal_stats->wal_sync_time);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 635d91d5..fc24745a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1482,6 +1482,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9c9091e6..64da738b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -583,6 +583,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e..ce695708 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fc2202b8..91673385 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5497,13 +5497,13 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
-{ oid => '1136', descr => 'statistics: information about WAL activity',
-  proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
-  proparallel => 'r', prorettype => 'record', proargtypes => '',
-   proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-   proargmodes => '{o,o,o,o,o}',
-   proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
-  prosrc => 'pg_stat_get_wal' },
+ { oid => '1136', descr => 'statistics: information about WAL activity',
+   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
+   proparallel => 'r', prorettype => 'record', proargtypes => '',
+   proallargtypes => '{int8,int8,numeric,int8,int8,float8,int8,float8,timestamptz}',
+   proargmodes => '{o,o,o,o,o,o,o,o,o}',
+   proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_write_time,wal_sync,wal_sync_time,stats_reset}',
+   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
   proname => 'pg_stat_get_slru', prorows => '100', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5954068d..a63619bc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -463,6 +463,10 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_write_time;	/* accumulate times in milliseconds */
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_sync_time;	/* accumulate times in milliseconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -805,6 +809,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6293ab57..b3bf1216 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2142,8 +2142,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_write_time,
+    w.wal_sync,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_write_time, wal_sync, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,

Masahiro Ikeda

ikedamsh@oss.nttdata.com

about 5 years ago

In reply to: Masahiro Ikeda (#3)

1 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

Hi,

I rebased the patch to the master branch.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

0003_add_wal_io_activity_to_the_pg_stat_wal.patchtext/x-diff; name=0003_add_wal_io_activity_to_the_pg_stat_wal.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4b60382778..45d54cd394 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7388,6 +7388,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        because it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3d6c901306..1a5aad819f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3479,7 +3479,51 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>wal_buffers_full</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of times WAL data was written to disk because WAL buffers became full
+       Total number of times WAL data was written to disk because WAL buffers became full
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was written to disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was written to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero).
+       To avoid standby server's performance degradation, they don't collect
+       this statistics
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was synced to disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was synced to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero).
+       To avoid standby server's performance degradation, they don't collect
+       this statistics
       </para></entry>
      </row>
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9867e1b403..19d101647e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -109,6 +109,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -980,6 +981,8 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static bool fsyncMethodCalled();
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -2538,6 +2541,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
+			instr_time	duration;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2546,9 +2551,24 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure i/o timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				if (track_wal_io_timing)
+				{
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time += INSTR_TIME_GET_MILLISEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10489,6 +10509,27 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 	}
 }
 
+/*
+ * Check if fsync mothod is called.
+ */
+static bool
+fsyncMethodCalled()
+{
+	if (!enableFsync)
+		return false;
+
+	switch (sync_method)
+	{
+		case SYNC_METHOD_FSYNC:
+		case SYNC_METHOD_FSYNC_WRITETHROUGH:
+		case SYNC_METHOD_FDATASYNC:
+			return true;
+		default:
+			/* others don't have a specific fsync method */
+			return false;
+	}
+}
+
 
 /*
  * Issue appropriate kind of fsync (if any) for an XLOG output file.
@@ -10499,8 +10540,19 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
+	instr_time	start;
+	instr_time	duration;
 	char	   *msg = NULL;
 
+	/*
+	 * Measure i/o timing to fsync WAL data.
+	 *
+	 * The wal receiver skip to collect it to avoid performance degradation of standy servers.
+	 * If sync_method doesn't have its fsync method, to skip too.
+	 */
+	if (!AmWalReceiverProcess() && track_wal_io_timing && fsyncMethodCalled())
+		INSTR_TIME_SET_CURRENT(start);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
 	{
@@ -10544,6 +10596,19 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/* increment the i/o timing and the number of times to fsync WAL data */
+	if (fsyncMethodCalled())
+	{
+		if (!AmWalReceiverProcess() && track_wal_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_sync_time += INSTR_TIME_GET_MILLISEC(duration);
+		}
+
+		WalStats.m_wal_sync++;
+	}
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b140c210bc..da4e813915 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -997,6 +997,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d87d9d06ee..656454fdcf 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -6795,6 +6795,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index a52832fe90..ce9f4b7cf7 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,6 +253,9 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/* Send WAL statistics */
+		pgstat_send_wal();
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6afe1b6f56..711a30ab38 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1703,7 +1703,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1724,7 +1724,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1744,7 +1752,17 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+
+	/* time is already in msec, just convert to double for presentation */
+	values[5] = Float8GetDatum((double) wal_stats->wal_write_time);
+
+	values[6] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* time is already in msec, just convert to double for presentation */
+	values[7] = Float8GetDatum((double) wal_stats->wal_sync_time);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 878fcc2236..ec7384c151 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1482,6 +1482,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b7fb2ec1fe..b430df8991 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -583,6 +583,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..ce6957084a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 22970f46cd..4ead818c58 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5497,13 +5497,13 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
-{ oid => '1136', descr => 'statistics: information about WAL activity',
-  proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
-  proparallel => 'r', prorettype => 'record', proargtypes => '',
-   proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-   proargmodes => '{o,o,o,o,o}',
-   proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
-  prosrc => 'pg_stat_get_wal' },
+ { oid => '1136', descr => 'statistics: information about WAL activity',
+   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
+   proparallel => 'r', prorettype => 'record', proargtypes => '',
+   proallargtypes => '{int8,int8,numeric,int8,int8,float8,int8,float8,timestamptz}',
+   proargmodes => '{o,o,o,o,o,o,o,o,o}',
+   proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_write_time,wal_sync,wal_sync_time,stats_reset}',
+   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
   proname => 'pg_stat_get_slru', prorows => '100', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5954068dec..a63619bc11 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -463,6 +463,10 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_write_time;	/* accumulate times in milliseconds */
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_sync_time;	/* accumulate times in milliseconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -805,6 +809,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6293ab57bc..b3bf121642 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2142,8 +2142,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_write_time,
+    w.wal_sync,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_write_time, wal_sync, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,

kuroda.hayato@fujitsu.com

almost 5 years ago

In reply to: Masahiro Ikeda (#4)

RE: About to add WAL write/fsync statistics to pg_stat_wal view

Dear Ikeda-san,

This patch cannot be applied to the HEAD, but anyway I put a comment.

```
+	/*
+	 * Measure i/o timing to fsync WAL data.
+	 *
+	 * The wal receiver skip to collect it to avoid performance degradation of standy servers.
+	 * If sync_method doesn't have its fsync method, to skip too.
+	 */
+	if (!AmWalReceiverProcess() && track_wal_io_timing && fsyncMethodCalled())
+		INSTR_TIME_SET_CURRENT(start);
```

I think m_wal_sync_time should be collected even if the process is WalRecevier.
Because all wal_fsync should be recorded, and
some performance issues have been aleady occurred if track_wal_io_timing is turned on.
I think it's strange only to take care of the walrecevier case.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Masahiro Ikeda (#4)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are comments:

---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }

* I think it should add the time in micro sec.

After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.

---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing && fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);

* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.

* How about checking track_wal_io_timing first?

* s/standy/standby/

---
+   /* increment the i/o timing and the number of times to fsync WAL data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time += INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }

* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?

---
+/*
+ * Check if fsync mothod is called.
+ */
+static bool
+fsyncMethodCalled()
+{
+   if (!enableFsync)
+       return false;
+
+   switch (sync_method)
+   {
+       case SYNC_METHOD_FSYNC:
+       case SYNC_METHOD_FSYNC_WRITETHROUGH:
+       case SYNC_METHOD_FDATASYNC:
+           return true;
+       default:
+           /* others don't have a specific fsync method */
+           return false;
+   }
+}

* I'm concerned that the function name could confuse the reader
because it's called even before the fsync method is called. As I
commented above, calling to fsyncMethodCalled() can be eliminated.
That way, this function is called at only once. So do we really need
this function?

* As far as I read the code, issue_xlog_fsync() seems to do fsync even
if enableFsync is false. Why does the function return false in that
case? I might be missing something.

* void is missing as argument?

* s/mothod/method/

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: kuroda.hayato@fujitsu.com (#5)

RE: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-01-22 11:54, kuroda.hayato@fujitsu.com wrote:

Dear Ikeda-san,

This patch cannot be applied to the HEAD, but anyway I put a comment.
```
+	/*
+	 * Measure i/o timing to fsync WAL data.
+	 *
+	 * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+	 * If sync_method doesn't have its fsync method, to skip too.
+	 */
+	if (!AmWalReceiverProcess() && track_wal_io_timing && 
fsyncMethodCalled())
+		INSTR_TIME_SET_CURRENT(start);
```
I think m_wal_sync_time should be collected even if the process is
WalRecevier.
Because all wal_fsync should be recorded, and
some performance issues have been aleady occurred if
track_wal_io_timing is turned on.
I think it's strange only to take care of the walrecevier case.

Kuroda-san, Thanks for your comments.

Although I thought that the performance impact may be bigger in standby
servers
because WALReceiver didn't use wal buffers, it's no need to be
considered.
I agreed that if track_wal_io_timing is turned on, the primary server's
performance degradation occurs too.

I will make rebased and modified.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Masahiko Sawada (#6)

1 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-01-22 14:50, Masahiko Sawada wrote:

On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:
---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }
* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.

Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.

---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing && 
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);
* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/

I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.

---
+   /* increment the i/o timing and the number of times to fsync WAL 
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time += 
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }

* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?

I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

---
+/*
+ * Check if fsync mothod is called.
+ */
+static bool
+fsyncMethodCalled()
+{
+   if (!enableFsync)
+       return false;
+
+   switch (sync_method)
+   {
+       case SYNC_METHOD_FSYNC:
+       case SYNC_METHOD_FSYNC_WRITETHROUGH:
+       case SYNC_METHOD_FDATASYNC:
+           return true;
+       default:
+           /* others don't have a specific fsync method */
+           return false;
+   }
+}
* I'm concerned that the function name could confuse the reader
because it's called even before the fsync method is called. As I
commented above, calling to fsyncMethodCalled() can be eliminated.
That way, this function is called at only once. So do we really need
this function?

Thanks to your comments, I removed them.

* As far as I read the code, issue_xlog_fsync() seems to do fsync even
if enableFsync is false. Why does the function return false in that
case? I might be missing something.

IIUC, the reason is that I thought that each fsync functions like
pg_fsync_no_writethrough() check enableFsync.

If this code doesn't check, m_wal_sync_time may be incremented
even though some sync methods like SYNC_METHOD_OPEN don't call to sync
some data to the disk at the time.

* void is missing as argument?

* s/mothod/method/

I removed them.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v4-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; name=v4-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload

From 3db5e6fc1b68ae02b393a02b2d28f0d163d7bd03 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Date: Fri, 22 Jan 2021 21:38:31 +0900
Subject: [PATCH] Add statistics related to write/sync wal records.

This patch adds following statistics to pg_stat_wal view
to track WAL I/O activity.

- the total number of writing/syncing WAL data.
- the total amount of time that has been spent in
  writing/syncing WAL data.

Since to track I/O timing may leads significant overhead,
GUC parameter "track_wal_io_timing" is introduced.
Only if this is on, the I/O timing is measured.

Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com
Reviewed-By: Japin Li,Hayato Kuroda,Masahiko Sawada

(This requires a catversion bump, as well as an update to PGSTAT_FILE_FORMAT_ID)
---
 doc/src/sgml/config.sgml                      | 21 +++++++
 doc/src/sgml/monitoring.sgml                  | 42 ++++++++++++-
 src/backend/access/transam/xlog.c             | 59 ++++++++++++++++++-
 src/backend/catalog/system_views.sql          |  4 ++
 src/backend/postmaster/pgstat.c               |  4 ++
 src/backend/postmaster/walwriter.c            |  3 +
 src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++-
 src/backend/utils/misc/guc.c                  |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h                     |  1 +
 src/include/catalog/pg_proc.dat               | 14 ++---
 src/include/pgstat.h                          |  8 +++
 src/test/regress/expected/rules.out           |  6 +-
 13 files changed, 183 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 82864bbb24..43f3fbcaf8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7416,6 +7416,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        because it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index f05140dd42..1c1342066f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3485,7 +3485,47 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>wal_buffers_full</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of times WAL data was written to disk because WAL buffers became full
+       Total number of times WAL data was written to disk because WAL buffers became full
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was written to disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was written to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero)
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was synced to disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was synced to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero)
       </para></entry>
      </row>
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 470e113b33..a3a4f969b7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2540,6 +2541,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
+			instr_time	duration;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2548,9 +2551,24 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure i/o timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				if (track_wal_io_timing)
+				{
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10565,7 +10583,33 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
-	char	   *msg = NULL;
+	char		*msg = NULL;
+	bool		sync_called;		/* whether to sync data to the disk. */
+	instr_time	start;
+	instr_time	duration;
+
+	/* check whether to sync data to the disk is really occurred. */
+	sync_called = false;
+	if (enableFsync)
+	{
+		switch (sync_method)
+		{
+			case SYNC_METHOD_FSYNC:
+#ifdef HAVE_FSYNC_WRITETHROUGH
+			case SYNC_METHOD_FSYNC_WRITETHROUGH:
+#endif
+#ifdef HAVE_FDATASYNC
+			case SYNC_METHOD_FDATASYNC:
+#endif
+				sync_called = true;
+			default:
+				break;
+		}
+	}
+
+	/* Measure i/o timing to fsync WAL data.*/
+	if (track_wal_io_timing && sync_called)
+		INSTR_TIME_SET_CURRENT(start);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10610,6 +10654,19 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	if (sync_called)
+	{
+		/* increment the i/o timing and the number of times to fsync WAL data */
+		if (track_wal_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
+		}
+
+		WalStats.m_wal_sync++;
+	}
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd9d7..b8ace4fc41 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1004,6 +1004,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..987bbd058d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -6892,6 +6892,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..104cba4581 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,6 +253,9 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/* Send WAL statistics */
+		pgstat_send_wal();
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..7296ef04ff 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,17 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+
+	/* convert counter from microsec to millisec for display */
+	values[5] = Float8GetDatum((double) wal_stats->wal_write_time / 1000.0);
+
+	values[6] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* convert counter from microsec to millisec for display */
+	values[7] = Float8GetDatum((double) wal_stats->wal_sync_time / 1000.0);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17579eeaca..ac6f0cd4ef 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8930a94fff..4dc79cf822 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -585,6 +585,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b5f52d4e4a..9fe8a72105 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5531,13 +5531,13 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
-{ oid => '1136', descr => 'statistics: information about WAL activity',
-  proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
-  proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
-  prosrc => 'pg_stat_get_wal' },
+ { oid => '1136', descr => 'statistics: information about WAL activity',
+   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
+   proparallel => 'r', prorettype => 'record', proargtypes => '',
+   proallargtypes => '{int8,int8,numeric,int8,int8,float8,int8,float8,timestamptz}',
+   proargmodes => '{o,o,o,o,o,o,o,o,o}',
+   proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_write_time,wal_sync,wal_sync_time,stats_reset}',
+   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
   proname => 'pg_stat_get_slru', prorows => '100', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..e689d27480 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,10 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_write_time;	/* time spend writing wal records in micro seconds */
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_sync_time;		/* time spend syncing wal records in micro seconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +843,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6173473de9..bc3909fd17 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2158,8 +2158,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_write_time,
+    w.wal_sync,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_write_time, wal_sync, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,
-- 
2.25.1

japin

japinli@hotmail.com

almost 5 years ago

In reply to: Masahiro Ikeda (#8)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

Hi, Masahiro

Thanks for you update the v4 patch. Here are some comments:

(1)
+       char            *msg = NULL;
+       bool            sync_called;            /* whether to sync data to the disk. */
+       instr_time      start;
+       instr_time      duration;
+
+       /* check whether to sync data to the disk is really occurred. */
+       sync_called = false;

Maybe we can initialize the "sync_called" variable when declare it.

(2)
+       if (sync_called)
+       {
+               /* increment the i/o timing and the number of times to fsync WAL data */
+               if (track_wal_io_timing)
+               {
+                       INSTR_TIME_SET_CURRENT(duration);
+                       INSTR_TIME_SUBTRACT(duration, start);
+                       WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
+               }
+
+               WalStats.m_wal_sync++;
+       }

There is an extra space before INSTR_TIME_GET_MICROSEC(duration).

In the issue_xlog_fsync(), the comment says that if sync_method is
SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC, it already write synced.
Does that mean it synced when write the WAL data? And for those cases, we
cannot get accurate write/sync timing and number of write/sync times, right?

case SYNC_METHOD_OPEN:
case SYNC_METHOD_OPEN_DSYNC:
/* write synced it already */
break;

On Fri, 22 Jan 2021 at 21:05, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

On 2021-01-22 14:50, Masahiko Sawada wrote:
On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:
---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }
* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.
Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.
---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing && 
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);
* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/
I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.
---
+   /* increment the i/o timing and the number of times to fsync WAL 
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time += 
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }
* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?
I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?
---
+/*
+ * Check if fsync mothod is called.
+ */
+static bool
+fsyncMethodCalled()
+{
+   if (!enableFsync)
+       return false;
+
+   switch (sync_method)
+   {
+       case SYNC_METHOD_FSYNC:
+       case SYNC_METHOD_FSYNC_WRITETHROUGH:
+       case SYNC_METHOD_FDATASYNC:
+           return true;
+       default:
+           /* others don't have a specific fsync method */
+           return false;
+   }
+}
* I'm concerned that the function name could confuse the reader
because it's called even before the fsync method is called. As I
commented above, calling to fsyncMethodCalled() can be eliminated.
That way, this function is called at only once. So do we really need
this function?
Thanks to your comments, I removed them.

* As far as I read the code, issue_xlog_fsync() seems to do fsync even
if enableFsync is false. Why does the function return false in that
case? I might be missing something.

IIUC, the reason is that I thought that each fsync functions like
pg_fsync_no_writethrough() check enableFsync.

If this code doesn't check, m_wal_sync_time may be incremented
even though some sync methods like SYNC_METHOD_OPEN don't call to sync
some data to the disk at the time.

* void is missing as argument?

* s/mothod/method/

I removed them.

Regards,

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

#10

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: japin (#9)

1 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

Hi, Japin

Thanks for your comments.

On 2021-01-23 01:46, japin wrote:

Hi, Masahiro

Thanks for you update the v4 patch. Here are some comments:
(1)
+       char            *msg = NULL;
+       bool            sync_called;            /* whether to sync
data to the disk. */
+       instr_time      start;
+       instr_time      duration;
+
+       /* check whether to sync data to the disk is really occurred. 
*/
+       sync_called = false;
Maybe we can initialize the "sync_called" variable when declare it.

Yes, I fixed it.

(2)
+       if (sync_called)
+       {
+               /* increment the i/o timing and the number of times to
fsync WAL data */
+               if (track_wal_io_timing)
+               {
+                       INSTR_TIME_SET_CURRENT(duration);
+                       INSTR_TIME_SUBTRACT(duration, start);
+                       WalStats.m_wal_sync_time =
INSTR_TIME_GET_MICROSEC(duration);
+               }
+
+               WalStats.m_wal_sync++;
+       }

There is an extra space before INSTR_TIME_GET_MICROSEC(duration).

Yes, I removed it.

In the issue_xlog_fsync(), the comment says that if sync_method is
SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC, it already write synced.
Does that mean it synced when write the WAL data? And for those cases,
we
cannot get accurate write/sync timing and number of write/sync times,
right?

case SYNC_METHOD_OPEN:
case SYNC_METHOD_OPEN_DSYNC:
/* write synced it already */
break;

Yes, I add the following comments in the document.

@@ -3515,6 +3515,9 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Total number of times WAL data was synced to disk
+       (if <xref linkend="guc-wal-sync-method"/> is 
<literal>open_datasync</literal> or
+       <literal>open_sync</literal>, this value is zero because WAL 
data is synced
+       when to write it).
        </para></entry>
       </row>

@@ -3525,7 +3528,10 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
        <para>
         Total amount of time that has been spent in the portion of
         WAL data was synced to disk, in milliseconds
-       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, 
otherwise zero)
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, 
otherwise zero.
+       if <xref linkend="guc-wal-sync-method"/> is 
<literal>open_datasync</literal> or
+       <literal>open_sync</literal>, this value is zero too because WAL 
data is synced
+       when to write it).
        </para></entry>
       </row>

I attached a modified patch.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v5-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; charset=us-ascii; name=v5-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload

From ee1b7d17391b9d9619f709afeacdd118973471d6 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Date: Fri, 22 Jan 2021 21:38:31 +0900
Subject: [PATCH] Add statistics related to write/sync wal records.

This patch adds following statistics to pg_stat_wal view
to track WAL I/O activity.

- the total number of writing/syncing WAL data.
- the total amount of time that has been spent in
  writing/syncing WAL data.

Since to track I/O timing may leads significant overhead,
GUC parameter "track_wal_io_timing" is introduced.
Only if this is on, the I/O timing is measured.

The statistics related to sync are zero when "wal_sync_method"
is "open_datasync" or "open_sync", because it leads syncing
WAL data at same time when to write it.

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada
Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com

(This requires a catversion bump, as well as an update to PGSTAT_FILE_FORMAT_ID)
---
 doc/src/sgml/config.sgml                      | 21 +++++++
 doc/src/sgml/monitoring.sgml                  | 48 ++++++++++++++-
 src/backend/access/transam/xlog.c             | 59 ++++++++++++++++++-
 src/backend/catalog/system_views.sql          |  4 ++
 src/backend/postmaster/pgstat.c               |  4 ++
 src/backend/postmaster/walwriter.c            |  3 +
 src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++-
 src/backend/utils/misc/guc.c                  |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h                     |  1 +
 src/include/catalog/pg_proc.dat               | 14 ++---
 src/include/pgstat.h                          |  8 +++
 src/test/regress/expected/rules.out           |  6 +-
 13 files changed, 189 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 82864bbb24..43f3fbcaf8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7416,6 +7416,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        because it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index f05140dd42..36764dc64f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>wal_buffers_full</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of times WAL data was written to disk because WAL buffers became full
+       Total number of times WAL data was written to disk because WAL buffers became full
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was written to disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was written to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was synced to disk
+       (if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>, this value is zero because WAL data is synced 
+       when to write it).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was synced to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero.
+       if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>, this value is zero too because WAL data is synced 
+       when to write it).
       </para></entry>
      </row>
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 470e113b33..b3890f97a2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2540,6 +2541,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
+			instr_time	duration;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2548,9 +2551,24 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure i/o timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				if (track_wal_io_timing)
+				{
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10565,7 +10583,33 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
-	char	   *msg = NULL;
+	char		*msg = NULL;
+	bool		sync_called = false;	/* whether to sync data right now. */
+	instr_time	start;
+	instr_time	duration;
+
+	/* check whether to sync WAL data to the disk right now. */
+	if (enableFsync)
+	{
+		switch (sync_method)
+		{
+			case SYNC_METHOD_FSYNC:
+#ifdef HAVE_FSYNC_WRITETHROUGH
+			case SYNC_METHOD_FSYNC_WRITETHROUGH:
+#endif
+#ifdef HAVE_FDATASYNC
+			case SYNC_METHOD_FDATASYNC:
+#endif
+				sync_called = true;
+			default:
+				/* other method synced data already or it's not unrecognized. */
+				break;
+		}
+	}
+
+	/* Measure i/o timing to sync WAL data.*/
+	if (track_wal_io_timing && sync_called)
+		INSTR_TIME_SET_CURRENT(start);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10610,6 +10654,19 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	if (sync_called)
+	{
+		/* increment the i/o timing and the number of times to fsync WAL data */
+		if (track_wal_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
+		}
+
+		WalStats.m_wal_sync++;
+	}
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd9d7..b8ace4fc41 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1004,6 +1004,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..987bbd058d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -6892,6 +6892,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..104cba4581 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,6 +253,9 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/* Send WAL statistics */
+		pgstat_send_wal();
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..7296ef04ff 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,17 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+
+	/* convert counter from microsec to millisec for display */
+	values[5] = Float8GetDatum((double) wal_stats->wal_write_time / 1000.0);
+
+	values[6] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* convert counter from microsec to millisec for display */
+	values[7] = Float8GetDatum((double) wal_stats->wal_sync_time / 1000.0);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17579eeaca..ac6f0cd4ef 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8930a94fff..4dc79cf822 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -585,6 +585,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b5f52d4e4a..9fe8a72105 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5531,13 +5531,13 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
-{ oid => '1136', descr => 'statistics: information about WAL activity',
-  proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
-  proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
-  prosrc => 'pg_stat_get_wal' },
+ { oid => '1136', descr => 'statistics: information about WAL activity',
+   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
+   proparallel => 'r', prorettype => 'record', proargtypes => '',
+   proallargtypes => '{int8,int8,numeric,int8,int8,float8,int8,float8,timestamptz}',
+   proargmodes => '{o,o,o,o,o,o,o,o,o}',
+   proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_write_time,wal_sync,wal_sync_time,stats_reset}',
+   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
   proname => 'pg_stat_get_slru', prorows => '100', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..e689d27480 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,10 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_write_time;	/* time spend writing wal records in micro seconds */
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_sync_time;		/* time spend syncing wal records in micro seconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +843,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6173473de9..bc3909fd17 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2158,8 +2158,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_write_time,
+    w.wal_sync,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_write_time, wal_sync, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,
-- 
2.25.1

#11

kuroda.hayato@fujitsu.com

almost 5 years ago

In reply to: Masahiro Ikeda (#10)

RE: About to add WAL write/fsync statistics to pg_stat_wal view

Dear Ikeda-san,

Thank you for updating the patch. This can be applied to master, and
can be used on my RHEL7.
wal_write_time and wal_sync_time increase normally :-).

I put a further comment:

```
@@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>wal_buffers_full</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of times WAL data was written to disk because WAL buffers became full
+       Total number of times WAL data was written to disk because WAL buffers became full
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was written to disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was written to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was synced to disk
+       (if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>, this value is zero because WAL data is synced 
+       when to write it).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was synced to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero.
+       if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>, this value is zero too because WAL data is synced 
+       when to write it).
       </para></entry>
      </row>
 ```

Maybe "Total amount of time" should be used, not "Total number of time."
Other views use "amount."

I have no comments anymore.

Hayato Kuroda
FUJITSU LIMITED

#12

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Masahiro Ikeda (#8)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

On 2021-01-22 14:50, Masahiko Sawada wrote:
On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:
---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }
* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.
Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.
---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing &&
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);
* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/
I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.

I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.

And maybe XLogWalRcvWrite() also needs to track I/O?

---
+   /* increment the i/o timing and the number of times to fsync WAL
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time +=
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }
* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?
I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.

Right. I missed that each fsync function checks enableFsync.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

(do fsync and error handling here)

/* increment the i/o timing and the number of times to fsync WAL data */
if (track_wal_io_timing)
{
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}

I think we can change the first switch-case to an if statement.

* As far as I read the code, issue_xlog_fsync() seems to do fsync even
if enableFsync is false. Why does the function return false in that
case? I might be missing something.

IIUC, the reason is that I thought that each fsync functions like
pg_fsync_no_writethrough() check enableFsync.

If this code doesn't check, m_wal_sync_time may be incremented
even though some sync methods like SYNC_METHOD_OPEN don't call to sync
some data to the disk at the time.

Right.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#13

japin

japinli@hotmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#12)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On Mon, 25 Jan 2021 at 09:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:
On 2021-01-22 14:50, Masahiko Sawada wrote:
On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:
---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }
* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.
Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.
---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing &&
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);
* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/
I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.
I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.

And maybe XLogWalRcvWrite() also needs to track I/O?
---
+   /* increment the i/o timing and the number of times to fsync WAL
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time +=
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }
* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?
I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.
Right. I missed that each fsync function checks enableFsync.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

(do fsync and error handling here)

/* increment the i/o timing and the number of times to fsync WAL data */
if (track_wal_io_timing)
{
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}

I think we can change the first switch-case to an if statement.

+1. We can also narrow the scope of "duration" into "if (track_wal_io_timing)" branch.

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

#14

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: kuroda.hayato@fujitsu.com (#11)

RE: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-01-25 10:34, kuroda.hayato@fujitsu.com wrote:

Dear Ikeda-san,

Thank you for updating the patch. This can be applied to master, and
can be used on my RHEL7.
wal_write_time and wal_sync_time increase normally :-).

```
postgres=# select * from pg_stat_wal;
-[ RECORD 1 ]----+------------------------------
wal_records | 121781
wal_fpi | 2287
wal_bytes | 36055146
wal_buffers_full | 799
wal_write | 12770
wal_write_time | 4.469
wal_sync | 11962
wal_sync_time | 132.352
stats_reset | 2021-01-25 00:51:40.674412+00
```

Thanks for checking.

I put a further comment:

```
@@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM
pg_stat_activity WHERE wait_event i
<structfield>wal_buffers_full</structfield> <type>bigint</type>
</para>
<para>
-       Number of times WAL data was written to disk because WAL
buffers became full
+       Total number of times WAL data was written to disk because WAL
buffers became full
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para 
role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was written to disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para 
role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double 
precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was written to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
otherwise zero).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para 
role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was synced to disk
+       (if <xref linkend="guc-wal-sync-method"/> is
<literal>open_datasync</literal> or
+       <literal>open_sync</literal>, this value is zero because WAL
data is synced
+       when to write it).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para 
role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double 
precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was synced to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
otherwise zero.
+       if <xref linkend="guc-wal-sync-method"/> is
<literal>open_datasync</literal> or
+       <literal>open_sync</literal>, this value is zero too because
WAL data is synced
+       when to write it).
</para></entry>
</row>
```

Maybe "Total amount of time" should be used, not "Total number of
time."
Other views use "amount."

Thanks.

I checked columns' descriptions of other views.
There are "Number of xxx", "Total number of xxx", "Total amount of time
that xxx" and "Total time spent xxx".

Since the "time" is used for showing spending time, not count,
I'll change it to "Total number of WAL data written/synced to disk".
Thought?

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#15

kuroda.hayato@fujitsu.com

almost 5 years ago

In reply to: Masahiro Ikeda (#14)

RE: About to add WAL write/fsync statistics to pg_stat_wal view

Dear Ikeda-san,

I checked columns' descriptions of other views.
There are "Number of xxx", "Total number of xxx", "Total amount of time
that xxx" and "Total time spent xxx".

Right.

Since the "time" is used for showing spending time, not count,
I'll change it to "Total number of WAL data written/synced to disk".
Thought?

I misread your patch, sorry. I prefer your suggestion.
Please fix like that way with others.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#16

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Masahiko Sawada (#12)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-01-25 10:36, Masahiko Sawada wrote:

On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:
On 2021-01-22 14:50, Masahiko Sawada wrote:
On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:
---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }
* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.
Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.
---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing &&
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);
* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/
I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.
I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.
And maybe XLogWalRcvWrite() also needs to track I/O?

Thanks, I forgot to add them.
I'll fix it.

---
+   /* increment the i/o timing and the number of times to fsync WAL
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time +=
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }
* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?
I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.
Right. I missed that each fsync function checks enableFsync.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

(do fsync and error handling here)

/* increment the i/o timing and the number of times to fsync WAL
data */
if (track_wal_io_timing)
{
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}

IIUC, I think we can't handle the following case.

When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and
"track_wal_io_timing" is enabled, "start" doesn't be initialized.

My understanding is something wrong, isn't it?

I think we can change the first switch-case to an if statement.

Yes, I'll change it.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#17

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: japin (#13)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-01-25 11:47, japin wrote:

On Mon, 25 Jan 2021 at 09:36, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:
On 2021-01-22 14:50, Masahiko Sawada wrote:
On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:
---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }
* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.
Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.
---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing &&
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);
* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/
I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.
I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.

And maybe XLogWalRcvWrite() also needs to track I/O?
---
+   /* increment the i/o timing and the number of times to fsync WAL
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time +=
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }
* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?
I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.
Right. I missed that each fsync function checks enableFsync.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

(do fsync and error handling here)

/* increment the i/o timing and the number of times to fsync WAL
data */
if (track_wal_io_timing)
{
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}

I think we can change the first switch-case to an if statement.
+1. We can also narrow the scope of "duration" into "if
(track_wal_io_timing)" branch.

Thanks, I'll change it.

--
Masahiro Ikeda
NTT DATA CORPORATION

#18

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#16)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-01-25 13:15, Masahiro Ikeda wrote:

On 2021-01-25 10:36, Masahiko Sawada wrote:
On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:
On 2021-01-22 14:50, Masahiko Sawada wrote:
On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:
---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }
* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.
Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.
---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing &&
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);
* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/
I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.
I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.
And maybe XLogWalRcvWrite() also needs to track I/O?
Thanks, I forgot to add them.
I'll fix it.
---
+   /* increment the i/o timing and the number of times to fsync WAL
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time +=
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }
* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?
I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.
Right. I missed that each fsync function checks enableFsync.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

(do fsync and error handling here)

/* increment the i/o timing and the number of times to fsync WAL
data */
if (track_wal_io_timing)
{
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}
IIUC, I think we can't handle the following case.

When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and
"track_wal_io_timing" is enabled, "start" doesn't be initialized.

My understanding is something wrong, isn't it?

I thought the following is better.

```
/* Measure i/o timing to sync WAL data.*/
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);

(do fsync and error handling here)

/* check whether to sync WAL data to the disk right now. */
if (enableFsync)
{
if ((sync_method == SYNC_METHOD_FSYNC) ||
(sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH) ||
(sync_method == SYNC_METHOD_FDATASYNC))
{
/* increment the i/o timing and the number of times to fsync WAL data
*/
if (track_wal_io_timing)
{
instr_time duration;

INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}
WalStats.m_wal_sync++;
}
}
```

Although INSTR_TIME_SET_CURRENT(start) is called everytime regardless
of the "sync_method" and "enableFsync", we don't make additional
variables.
But it's ok because "track_wal_io_timing" leads already performance
degradation.

What do you think?

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#19

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Masahiro Ikeda (#18)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On Mon, Jan 25, 2021 at 1:28 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

On 2021-01-25 13:15, Masahiro Ikeda wrote:
On 2021-01-25 10:36, Masahiko Sawada wrote:
On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:
On 2021-01-22 14:50, Masahiko Sawada wrote:
On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:
---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }
* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.
Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.
---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing &&
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);
* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/
I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.
I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.
And maybe XLogWalRcvWrite() also needs to track I/O?
Thanks, I forgot to add them.
I'll fix it.
---
+   /* increment the i/o timing and the number of times to fsync WAL
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time +=
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }
* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?
I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.
Right. I missed that each fsync function checks enableFsync.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

(do fsync and error handling here)

/* increment the i/o timing and the number of times to fsync WAL
data */
if (track_wal_io_timing)
{
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}
IIUC, I think we can't handle the following case.

When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and
"track_wal_io_timing" is enabled, "start" doesn't be initialized.

My understanding is something wrong, isn't it?

You're right. We might want to initialize 'start' with 0 in those two
cases and check if INSTR_TIME_IS_ZERO() later when accumulating the
I/O time.

I thought the following is better.

```
/* Measure i/o timing to sync WAL data.*/
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);

(do fsync and error handling here)

/* check whether to sync WAL data to the disk right now. */
if (enableFsync)
{
if ((sync_method == SYNC_METHOD_FSYNC) ||
(sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH) ||
(sync_method == SYNC_METHOD_FDATASYNC))
{
/* increment the i/o timing and the number of times to fsync WAL data
*/
if (track_wal_io_timing)
{
instr_time duration;

INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}
WalStats.m_wal_sync++;
}
}
```

Although INSTR_TIME_SET_CURRENT(start) is called everytime regardless
of the "sync_method" and "enableFsync", we don't make additional
variables.
But it's ok because "track_wal_io_timing" leads already performance
degradation.

What do you think?

That also fine with me.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#20

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Masahiko Sawada (#19)

2 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

Hi, thanks for the reviews.

I updated the attached patch.
The summary of the changes is following.

1. fix document

I followed another view's comments.

2. refactor issue_xlog_fsync()

I removed "sync_called" variables, narrowed the "duration" scope and
change the switch statement to if statement.

3. make wal-receiver report WAL statistics

I add the code to collect the statistics for a written operation
in XLogWalRcvWrite() and to report stats in WalReceiverMain().

Since WalReceiverMain() can loop fast, to avoid loading stats collector,
I add "force" argument to the pgstat_send_wal function. If "force" is
false, it can skip reporting until at least 500 msec since it last
reported. WalReceiverMain() almost calls pgstat_send_wal() with "force"
as false.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v5_v6.difftext/x-diff; name=v5_v6.diffDownload

--- v5-0001-Add-statistics-related-to-write-sync-wal-records.patch	2021-01-23 09:26:01.919248712 +0900
+++ v6-0001-Add-statistics-related-to-write-sync-wal-records.patch	2021-01-25 16:27:50.749429666 +0900
@@ -1,6 +1,6 @@
-From ee1b7d17391b9d9619f709afeacdd118973471d6 Mon Sep 17 00:00:00 2001
+From e9aad92097c5cff5565b67ce1a8ec6d7b4c8a4d9 Mon Sep 17 00:00:00 2001
 From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
-Date: Fri, 22 Jan 2021 21:38:31 +0900
+Date: Mon, 25 Jan 2021 16:26:04 +0900
 Subject: [PATCH] Add statistics related to write/sync wal records.
 
 This patch adds following statistics to pg_stat_wal view
@@ -24,20 +24,22 @@
 
 (This requires a catversion bump, as well as an update to PGSTAT_FILE_FORMAT_ID)
 ---
- doc/src/sgml/config.sgml                      | 21 +++++++
- doc/src/sgml/monitoring.sgml                  | 48 ++++++++++++++-
- src/backend/access/transam/xlog.c             | 59 ++++++++++++++++++-
+ doc/src/sgml/config.sgml                      | 21 ++++++++
+ doc/src/sgml/monitoring.sgml                  | 48 ++++++++++++++++-
+ src/backend/access/transam/xlog.c             | 51 ++++++++++++++++++-
  src/backend/catalog/system_views.sql          |  4 ++
- src/backend/postmaster/pgstat.c               |  4 ++
- src/backend/postmaster/walwriter.c            |  3 +
- src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++-
- src/backend/utils/misc/guc.c                  |  9 +++
+ src/backend/postmaster/checkpointer.c         |  2 +-
+ src/backend/postmaster/pgstat.c               | 26 ++++++++--
+ src/backend/postmaster/walwriter.c            |  3 ++
+ src/backend/replication/walreceiver.c         | 30 +++++++++++
+ src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++--
+ src/backend/utils/misc/guc.c                  |  9 ++++
  src/backend/utils/misc/postgresql.conf.sample |  1 +
  src/include/access/xlog.h                     |  1 +
  src/include/catalog/pg_proc.dat               | 14 ++---
- src/include/pgstat.h                          |  8 +++
- src/test/regress/expected/rules.out           |  6 +-
- 13 files changed, 189 insertions(+), 13 deletions(-)
+ src/include/pgstat.h                          | 10 +++-
+ src/test/regress/expected/rules.out           |  6 ++-
+ 15 files changed, 232 insertions(+), 18 deletions(-)
 
 diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
 index 82864bbb24..43f3fbcaf8 100644
@@ -72,7 +74,7 @@
        <term><varname>track_functions</varname> (<type>enum</type>)
        <indexterm>
 diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
-index f05140dd42..36764dc64f 100644
+index f05140dd42..5a8fc4eb0c 100644
 --- a/doc/src/sgml/monitoring.sgml
 +++ b/doc/src/sgml/monitoring.sgml
 @@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
@@ -80,7 +82,7 @@
        </para>
        <para>
 -       Number of times WAL data was written to disk because WAL buffers became full
-+       Total number of times WAL data was written to disk because WAL buffers became full
++       Total number of WAL data written to disk because WAL buffers became full
 +      </para></entry>
 +     </row>
 +
@@ -89,7 +91,7 @@
 +       <structfield>wal_write</structfield> <type>bigint</type>
 +      </para>
 +      <para>
-+       Total number of times WAL data was written to disk
++       Total number of WAL data written to disk
 +      </para></entry>
 +     </row>
 +
@@ -109,7 +111,7 @@
 +       <structfield>wal_sync</structfield> <type>bigint</type>
 +      </para>
 +      <para>
-+       Total number of times WAL data was synced to disk
++       Total number of WAL data synced to disk
 +       (if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or 
 +       <literal>open_sync</literal>, this value is zero because WAL data is synced 
 +       when to write it).
@@ -131,7 +133,7 @@
       </row>
  
 diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
-index 470e113b33..b3890f97a2 100644
+index 470e113b33..1c4860bee7 100644
 --- a/src/backend/access/transam/xlog.c
 +++ b/src/backend/access/transam/xlog.c
 @@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
@@ -142,16 +144,15 @@
  
  #ifdef WAL_DEBUG
  bool		XLOG_DEBUG = false;
-@@ -2540,6 +2541,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
+@@ -2540,6 +2541,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
  			Size		nbytes;
  			Size		nleft;
  			int			written;
 +			instr_time	start;
-+			instr_time	duration;
  
  			/* OK to write the page(s) */
  			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
-@@ -2548,9 +2551,24 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
+@@ -2548,9 +2550,27 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
  			do
  			{
  				errno = 0;
@@ -164,8 +165,11 @@
  				written = pg_pwrite(openLogFile, from, nleft, startoffset);
  				pgstat_report_wait_end();
 +
++				/* increment the i/o timing and the number of WAL data written */
 +				if (track_wal_io_timing)
 +				{
++					instr_time	duration;
++
 +					INSTR_TIME_SET_CURRENT(duration);
 +					INSTR_TIME_SUBTRACT(duration, start);
 +					WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
@@ -176,57 +180,47 @@
  				if (written <= 0)
  				{
  					char		xlogfname[MAXFNAMELEN];
-@@ -10565,7 +10583,33 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
+@@ -10565,7 +10585,12 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
  void
  issue_xlog_fsync(int fd, XLogSegNo segno)
  {
 -	char	   *msg = NULL;
 +	char		*msg = NULL;
-+	bool		sync_called = false;	/* whether to sync data right now. */
 +	instr_time	start;
-+	instr_time	duration;
-+
-+	/* check whether to sync WAL data to the disk right now. */
-+	if (enableFsync)
-+	{
-+		switch (sync_method)
-+		{
-+			case SYNC_METHOD_FSYNC:
-+#ifdef HAVE_FSYNC_WRITETHROUGH
-+			case SYNC_METHOD_FSYNC_WRITETHROUGH:
-+#endif
-+#ifdef HAVE_FDATASYNC
-+			case SYNC_METHOD_FDATASYNC:
-+#endif
-+				sync_called = true;
-+			default:
-+				/* other method synced data already or it's not unrecognized. */
-+				break;
-+		}
-+	}
 +
 +	/* Measure i/o timing to sync WAL data.*/
-+	if (track_wal_io_timing && sync_called)
++	if (track_wal_io_timing)
 +		INSTR_TIME_SET_CURRENT(start);
  
  	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
  	switch (sync_method)
-@@ -10610,6 +10654,19 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
+@@ -10610,6 +10635,30 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
  	}
  
  	pgstat_report_wait_end();
 +
-+	if (sync_called)
++	/* 
++	 * check whether to sync WAL data to the disk right now because 
++	 * statistics must be incremented when syncing really occurred.
++	 */
++	if (enableFsync)
 +	{
-+		/* increment the i/o timing and the number of times to fsync WAL data */
-+		if (track_wal_io_timing)
++		if ((sync_method == SYNC_METHOD_FSYNC) ||
++			(sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH) ||
++			(sync_method == SYNC_METHOD_FDATASYNC))
 +		{
-+			INSTR_TIME_SET_CURRENT(duration);
-+			INSTR_TIME_SUBTRACT(duration, start);
-+			WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
-+		}
++			/* increment the i/o timing and the number of WAL data synced */
++			if (track_wal_io_timing)
++			{
++				instr_time	duration;
++
++				INSTR_TIME_SET_CURRENT(duration);
++				INSTR_TIME_SUBTRACT(duration, start);
++				WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
++			}
 +
-+		WalStats.m_wal_sync++;
++			WalStats.m_wal_sync++;
++		}
 +	}
  }
  
@@ -246,11 +240,69 @@
          w.stats_reset
      FROM pg_stat_get_wal() w;
  
+diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
+index 54a818bf61..5d14a97e56 100644
+--- a/src/backend/postmaster/checkpointer.c
++++ b/src/backend/postmaster/checkpointer.c
+@@ -505,7 +505,7 @@ CheckpointerMain(void)
+ 		pgstat_send_bgwriter();
+ 
+ 		/* Send WAL statistics to the stats collector. */
+-		pgstat_send_wal();
++		pgstat_send_wal(true);
+ 
+ 		/*
+ 		 * If any checkpoint flags have been set, redo the loop to handle the
 diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
-index f75b52719d..987bbd058d 100644
+index f75b52719d..256d8706ca 100644
 --- a/src/backend/postmaster/pgstat.c
 +++ b/src/backend/postmaster/pgstat.c
-@@ -6892,6 +6892,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
+@@ -975,7 +975,7 @@ pgstat_report_stat(bool disconnect)
+ 	pgstat_send_funcstats();
+ 
+ 	/* Send WAL statistics */
+-	pgstat_send_wal();
++	pgstat_send_wal(true);
+ 
+ 	/* Finally send SLRU statistics */
+ 	pgstat_send_slru();
+@@ -4669,17 +4669,33 @@ pgstat_send_bgwriter(void)
+ /* ----------
+  * pgstat_send_wal() -
+  *
+- *		Send WAL statistics to the collector
++ *		Send WAL statistics to the collector.
++ *
++ *		If force is false, don't send a message unless it's been at 
++ *		least PGSTAT_STAT_INTERVAL msec since we last sent one.
+  * ----------
+  */
+ void
+-pgstat_send_wal(void)
++pgstat_send_wal(bool force)
+ {
+ 	/* We assume this initializes to zeroes */
+ 	static const PgStat_MsgWal all_zeroes;
++	static TimestampTz last_report = 0;
+ 
++	TimestampTz	now;
+ 	WalUsage	walusage;
+ 
++	/*
++	 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
++	 * msec since we last sent one or specified "force".
++	 */
++	now = GetCurrentTimestamp();
++	if (!force &&
++		!TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
++		return;
++
++	last_report = now;
++
+ 	/*
+ 	 * Calculate how much WAL usage counters are increased by substracting the
+ 	 * previous counters from the current ones. Fill the results in WAL stats
+@@ -6892,6 +6908,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
  	walStats.wal_fpi += msg->m_wal_fpi;
  	walStats.wal_bytes += msg->m_wal_bytes;
  	walStats.wal_buffers_full += msg->m_wal_buffers_full;
@@ -262,7 +314,7 @@
  
  /* ----------
 diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
-index 4f1a8e356b..104cba4581 100644
+index 4f1a8e356b..7fd56d1497 100644
 --- a/src/backend/postmaster/walwriter.c
 +++ b/src/backend/postmaster/walwriter.c
 @@ -253,6 +253,9 @@ WalWriterMain(void)
@@ -270,11 +322,68 @@
  			left_till_hibernate--;
  
 +		/* Send WAL statistics */
-+		pgstat_send_wal();
++		pgstat_send_wal(true);
 +
  		/*
  		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
  		 * haven't done anything useful for quite some time, lengthen the
+diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
+index 723f513d8b..901b194773 100644
+--- a/src/backend/replication/walreceiver.c
++++ b/src/backend/replication/walreceiver.c
+@@ -485,7 +485,18 @@ WalReceiverMain(void)
+ 
+ 				/* Check if we need to exit the streaming loop. */
+ 				if (endofwal)
++				{
++					/* Send WAL statistics to the stats collector. */
++					pgstat_send_wal(true);
+ 					break;
++				}
++
++				/* 
++				 * Send WAL statistics to the stats collector.
++				 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
++				 * msec since we last sent one.
++				 */
++				pgstat_send_wal(false);
+ 
+ 				/*
+ 				 * Ideally we would reuse a WaitEventSet object repeatedly
+@@ -874,6 +885,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
+ 	while (nbytes > 0)
+ 	{
+ 		int			segbytes;
++		instr_time	start;
+ 
+ 		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+ 		{
+@@ -931,7 +943,25 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
+ 		/* OK to write the logs */
+ 		errno = 0;
+ 
++
++		/* Measure i/o timing to write WAL data */
++		if (track_wal_io_timing)
++			INSTR_TIME_SET_CURRENT(start);
++
+ 		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
++
++		/* increment the i/o timing and the number of WAL data written */
++		if (track_wal_io_timing)
++		{
++			instr_time	duration;
++
++			INSTR_TIME_SET_CURRENT(duration);
++			INSTR_TIME_SUBTRACT(duration, start);
++			WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
++		}
++
++		WalStats.m_wal_write++;
++
+ 		if (byteswritten <= 0)
+ 		{
+ 			char		xlogfname[MAXFNAMELEN];
 diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
 index 62bff52638..7296ef04ff 100644
 --- a/src/backend/utils/adt/pgstatfuncs.c
@@ -394,7 +503,7 @@
  { oid => '2306', descr => 'statistics: information about SLRU caches',
    proname => 'pg_stat_get_slru', prorows => '100', proisstrict => 'f',
 diff --git a/src/include/pgstat.h b/src/include/pgstat.h
-index 724068cf87..e689d27480 100644
+index 724068cf87..8ef959c0cc 100644
 --- a/src/include/pgstat.h
 +++ b/src/include/pgstat.h
 @@ -474,6 +474,10 @@ typedef struct PgStat_MsgWal
@@ -419,6 +528,15 @@
  	TimestampTz stat_reset_timestamp;
  } PgStat_WalStats;
  
+@@ -1590,7 +1598,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
+ 
+ extern void pgstat_send_archiver(const char *xlog, bool failed);
+ extern void pgstat_send_bgwriter(void);
+-extern void pgstat_send_wal(void);
++extern void pgstat_send_wal(bool force);
+ 
+ /* ----------
+  * Support functions for the SQL-callable functions to
 diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
 index 6173473de9..bc3909fd17 100644
 --- a/src/test/regress/expected/rules.out

v6-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; name=v6-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload

From e9aad92097c5cff5565b67ce1a8ec6d7b4c8a4d9 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Date: Mon, 25 Jan 2021 16:26:04 +0900
Subject: [PATCH] Add statistics related to write/sync wal records.

This patch adds following statistics to pg_stat_wal view
to track WAL I/O activity.

- the total number of writing/syncing WAL data.
- the total amount of time that has been spent in
  writing/syncing WAL data.

Since to track I/O timing may leads significant overhead,
GUC parameter "track_wal_io_timing" is introduced.
Only if this is on, the I/O timing is measured.

The statistics related to sync are zero when "wal_sync_method"
is "open_datasync" or "open_sync", because it leads syncing
WAL data at same time when to write it.

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada
Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com

(This requires a catversion bump, as well as an update to PGSTAT_FILE_FORMAT_ID)
---
 doc/src/sgml/config.sgml                      | 21 ++++++++
 doc/src/sgml/monitoring.sgml                  | 48 ++++++++++++++++-
 src/backend/access/transam/xlog.c             | 51 ++++++++++++++++++-
 src/backend/catalog/system_views.sql          |  4 ++
 src/backend/postmaster/checkpointer.c         |  2 +-
 src/backend/postmaster/pgstat.c               | 26 ++++++++--
 src/backend/postmaster/walwriter.c            |  3 ++
 src/backend/replication/walreceiver.c         | 30 +++++++++++
 src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++--
 src/backend/utils/misc/guc.c                  |  9 ++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h                     |  1 +
 src/include/catalog/pg_proc.dat               | 14 ++---
 src/include/pgstat.h                          | 10 +++-
 src/test/regress/expected/rules.out           |  6 ++-
 15 files changed, 232 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 82864bbb24..43f3fbcaf8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7416,6 +7416,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        because it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index f05140dd42..5a8fc4eb0c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>wal_buffers_full</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of times WAL data was written to disk because WAL buffers became full
+       Total number of WAL data written to disk because WAL buffers became full
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of WAL data written to disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was written to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of WAL data synced to disk
+       (if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>, this value is zero because WAL data is synced 
+       when to write it).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was synced to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero.
+       if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>, this value is zero too because WAL data is synced 
+       when to write it).
       </para></entry>
      </row>
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 470e113b33..1c4860bee7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2540,6 +2541,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2548,9 +2550,27 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure i/o timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				/* increment the i/o timing and the number of WAL data written */
+				if (track_wal_io_timing)
+				{
+					instr_time	duration;
+
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10565,7 +10585,12 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
-	char	   *msg = NULL;
+	char		*msg = NULL;
+	instr_time	start;
+
+	/* Measure i/o timing to sync WAL data.*/
+	if (track_wal_io_timing)
+		INSTR_TIME_SET_CURRENT(start);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10610,6 +10635,30 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/* 
+	 * check whether to sync WAL data to the disk right now because 
+	 * statistics must be incremented when syncing really occurred.
+	 */
+	if (enableFsync)
+	{
+		if ((sync_method == SYNC_METHOD_FSYNC) ||
+			(sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH) ||
+			(sync_method == SYNC_METHOD_FDATASYNC))
+		{
+			/* increment the i/o timing and the number of WAL data synced */
+			if (track_wal_io_timing)
+			{
+				instr_time	duration;
+
+				INSTR_TIME_SET_CURRENT(duration);
+				INSTR_TIME_SUBTRACT(duration, start);
+				WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
+			}
+
+			WalStats.m_wal_sync++;
+		}
+	}
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd9d7..b8ace4fc41 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1004,6 +1004,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 54a818bf61..5d14a97e56 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_send_wal();
+		pgstat_send_wal(true);
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..256d8706ca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -975,7 +975,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_send_wal();
+	pgstat_send_wal(true);
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -4669,17 +4669,33 @@ pgstat_send_bgwriter(void)
 /* ----------
  * pgstat_send_wal() -
  *
- *		Send WAL statistics to the collector
+ *		Send WAL statistics to the collector.
+ *
+ *		If force is false, don't send a message unless it's been at 
+ *		least PGSTAT_STAT_INTERVAL msec since we last sent one.
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_send_wal(bool force)
 {
 	/* We assume this initializes to zeroes */
 	static const PgStat_MsgWal all_zeroes;
+	static TimestampTz last_report = 0;
 
+	TimestampTz	now;
 	WalUsage	walusage;
 
+	/*
+	 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+	 * msec since we last sent one or specified "force".
+	 */
+	now = GetCurrentTimestamp();
+	if (!force &&
+		!TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
+		return;
+
+	last_report = now;
+
 	/*
 	 * Calculate how much WAL usage counters are increased by substracting the
 	 * previous counters from the current ones. Fill the results in WAL stats
@@ -6892,6 +6908,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..7fd56d1497 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,6 +253,9 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/* Send WAL statistics */
+		pgstat_send_wal(true);
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 723f513d8b..901b194773 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -485,7 +485,18 @@ WalReceiverMain(void)
 
 				/* Check if we need to exit the streaming loop. */
 				if (endofwal)
+				{
+					/* Send WAL statistics to the stats collector. */
+					pgstat_send_wal(true);
 					break;
+				}
+
+				/* 
+				 * Send WAL statistics to the stats collector.
+				 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+				 * msec since we last sent one.
+				 */
+				pgstat_send_wal(false);
 
 				/*
 				 * Ideally we would reuse a WaitEventSet object repeatedly
@@ -874,6 +885,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	while (nbytes > 0)
 	{
 		int			segbytes;
+		instr_time	start;
 
 		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
 		{
@@ -931,7 +943,25 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		/* OK to write the logs */
 		errno = 0;
 
+
+		/* Measure i/o timing to write WAL data */
+		if (track_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
 		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+
+		/* increment the i/o timing and the number of WAL data written */
+		if (track_wal_io_timing)
+		{
+			instr_time	duration;
+
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
+		}
+
+		WalStats.m_wal_write++;
+
 		if (byteswritten <= 0)
 		{
 			char		xlogfname[MAXFNAMELEN];
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..7296ef04ff 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,17 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+
+	/* convert counter from microsec to millisec for display */
+	values[5] = Float8GetDatum((double) wal_stats->wal_write_time / 1000.0);
+
+	values[6] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* convert counter from microsec to millisec for display */
+	values[7] = Float8GetDatum((double) wal_stats->wal_sync_time / 1000.0);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17579eeaca..ac6f0cd4ef 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8930a94fff..4dc79cf822 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -585,6 +585,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b5f52d4e4a..9fe8a72105 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5531,13 +5531,13 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
-{ oid => '1136', descr => 'statistics: information about WAL activity',
-  proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
-  proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
-  prosrc => 'pg_stat_get_wal' },
+ { oid => '1136', descr => 'statistics: information about WAL activity',
+   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
+   proparallel => 'r', prorettype => 'record', proargtypes => '',
+   proallargtypes => '{int8,int8,numeric,int8,int8,float8,int8,float8,timestamptz}',
+   proargmodes => '{o,o,o,o,o,o,o,o,o}',
+   proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_write_time,wal_sync,wal_sync_time,stats_reset}',
+   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
   proname => 'pg_stat_get_slru', prorows => '100', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..8ef959c0cc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,10 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_write_time;	/* time spend writing wal records in micro seconds */
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_sync_time;		/* time spend syncing wal records in micro seconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +843,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -1590,7 +1598,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6173473de9..bc3909fd17 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2158,8 +2158,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_write_time,
+    w.wal_sync,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_write_time, wal_sync, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,
-- 
2.25.1

#21

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Masahiro Ikeda (#20)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

Hi, thanks for the reviews.

I updated the attached patch.

Thank you for updating the patch!

The summary of the changes is following.

1. fix document

I followed another view's comments.

2. refactor issue_xlog_fsync()

I removed "sync_called" variables, narrowed the "duration" scope and
change the switch statement to if statement.

Looking at the code again, I think if we check if an fsync was really
called when calculating the I/O time, it's better to check that before
starting the measurement.

bool issue_fsync = false;

if (enableFsync &&
(sync_method == SYNC_METHOD_FSYNC ||
sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
sync_method == SYNC_METHOD_FDATASYNC))
{
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
issue_fsync = true;
}
(snip)
if (issue_fsync)
{
if (track_wal_io_timing)
{
instr_time duration;

INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}
WalStats.m_wal_sync++;
}

So I prefer either the above which is a modified version of the
original approach or my idea that doesn’t introduce a new local
variable I proposed before. But I'm not going to insist on that.

3. make wal-receiver report WAL statistics

I add the code to collect the statistics for a written operation
in XLogWalRcvWrite() and to report stats in WalReceiverMain().

Since WalReceiverMain() can loop fast, to avoid loading stats collector,
I add "force" argument to the pgstat_send_wal function. If "force" is
false, it can skip reporting until at least 500 msec since it last
reported. WalReceiverMain() almost calls pgstat_send_wal() with "force"
as false.

void
-pgstat_send_wal(void)
+pgstat_send_wal(bool force)
{
/* We assume this initializes to zeroes */
static const PgStat_MsgWal all_zeroes;
+ static TimestampTz last_report = 0;

+ TimestampTz now;
WalUsage walusage;

+   /*
+    * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+    * msec since we last sent one or specified "force".
+    */
+   now = GetCurrentTimestamp();
+   if (!force &&
+       !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
+       return;
+
+   last_report = now;

Hmm, I don’t think it's good to use PGSTAT_STAT_INTERVAL for this
purpose since it is used as a minimum time for stats file updates. If
we want an interval, I think we should define another one Also, with
the patch, pgstat_send_wal() calls GetCurrentTimestamp() every time
even if track_wal_io_timing is off, which is not good. On the other
hand, I agree that your concern that the wal receiver should not send
the stats for whenever receiving wal records. So an idea could be to
send the wal stats when finishing the current WAL segment file and
when timeout in the main loop. That way we can guarantee that the wal
stats on a replica is updated at least every time finishing a WAL
segment file when actively receiving WAL records and every
NAPTIME_PER_CYCLE in other cases.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#22

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Masahiko Sawada (#21)

2 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-01-26 00:03, Masahiko Sawada wrote:

On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi, thanks for the reviews.

I updated the attached patch.

Thank you for updating the patch!

The summary of the changes is following.

1. fix document

I followed another view's comments.

2. refactor issue_xlog_fsync()

I removed "sync_called" variables, narrowed the "duration" scope and
change the switch statement to if statement.

Looking at the code again, I think if we check if an fsync was really
called when calculating the I/O time, it's better to check that before
starting the measurement.

bool issue_fsync = false;

if (enableFsync &&
(sync_method == SYNC_METHOD_FSYNC ||
sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
sync_method == SYNC_METHOD_FDATASYNC))
{
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
issue_fsync = true;
}
(snip)
if (issue_fsync)
{
if (track_wal_io_timing)
{
instr_time duration;

INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time =
INSTR_TIME_GET_MICROSEC(duration);
}
WalStats.m_wal_sync++;
}

So I prefer either the above which is a modified version of the
original approach or my idea that doesn’t introduce a new local
variable I proposed before. But I'm not going to insist on that.

Thanks for the comments.
I change the code to the above.

3. make wal-receiver report WAL statistics

I add the code to collect the statistics for a written operation
in XLogWalRcvWrite() and to report stats in WalReceiverMain().

Since WalReceiverMain() can loop fast, to avoid loading stats
collector,
I add "force" argument to the pgstat_send_wal function. If "force" is
false, it can skip reporting until at least 500 msec since it last
reported. WalReceiverMain() almost calls pgstat_send_wal() with
"force"
as false.

void
-pgstat_send_wal(void)
+pgstat_send_wal(bool force)
{
/* We assume this initializes to zeroes */
static const PgStat_MsgWal all_zeroes;
+ static TimestampTz last_report = 0;

+ TimestampTz now;
WalUsage walusage;
+   /*
+    * Don't send a message unless it's been at least 
PGSTAT_STAT_INTERVAL
+    * msec since we last sent one or specified "force".
+    */
+   now = GetCurrentTimestamp();
+   if (!force &&
+       !TimestampDifferenceExceeds(last_report, now, 
PGSTAT_STAT_INTERVAL))
+       return;
+
+   last_report = now;
Hmm, I don’t think it's good to use PGSTAT_STAT_INTERVAL for this
purpose since it is used as a minimum time for stats file updates. If
we want an interval, I think we should define another one Also, with
the patch, pgstat_send_wal() calls GetCurrentTimestamp() every time
even if track_wal_io_timing is off, which is not good. On the other
hand, I agree that your concern that the wal receiver should not send
the stats for whenever receiving wal records. So an idea could be to
send the wal stats when finishing the current WAL segment file and
when timeout in the main loop. That way we can guarantee that the wal
stats on a replica is updated at least every time finishing a WAL
segment file when actively receiving WAL records and every
NAPTIME_PER_CYCLE in other cases.

I agree with your comments. I think it should report when
reaching the end of WAL too. I add the code to report the stats
when finishing the current WAL segment file when timeout in the
main loop and when reaching the end of WAL.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v6_v7.difftext/x-diff; name=v6_v7.diffDownload

--- v6-0001-Add-statistics-related-to-write-sync-wal-records.patch	2021-01-25 16:27:50.749429666 +0900
+++ v7-0001-Add-statistics-related-to-write-sync-wal-records.patch	2021-01-26 08:19:48.269760642 +0900
@@ -1,6 +1,6 @@
-From e9aad92097c5cff5565b67ce1a8ec6d7b4c8a4d9 Mon Sep 17 00:00:00 2001
+From 02f0888efeb09ae641d9ef905788d995d687c56f Mon Sep 17 00:00:00 2001
 From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
-Date: Mon, 25 Jan 2021 16:26:04 +0900
+Date: Tue, 26 Jan 2021 08:18:37 +0900
 Subject: [PATCH] Add statistics related to write/sync wal records.
 
 This patch adds following statistics to pg_stat_wal view
@@ -22,24 +22,24 @@
 Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada
 Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com
 
-(This requires a catversion bump, as well as an update to PGSTAT_FILE_FORMAT_ID)
+(This requires a catversion bump, as well as an update to PGSTAT_FILE_FORMAT_ID
 ---
- doc/src/sgml/config.sgml                      | 21 ++++++++
- doc/src/sgml/monitoring.sgml                  | 48 ++++++++++++++++-
- src/backend/access/transam/xlog.c             | 51 ++++++++++++++++++-
+ doc/src/sgml/config.sgml                      | 21 +++++++
+ doc/src/sgml/monitoring.sgml                  | 48 +++++++++++++++-
+ src/backend/access/transam/xlog.c             | 56 ++++++++++++++++++-
  src/backend/catalog/system_views.sql          |  4 ++
  src/backend/postmaster/checkpointer.c         |  2 +-
- src/backend/postmaster/pgstat.c               | 26 ++++++++--
- src/backend/postmaster/walwriter.c            |  3 ++
- src/backend/replication/walreceiver.c         | 30 +++++++++++
- src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++--
- src/backend/utils/misc/guc.c                  |  9 ++++
+ src/backend/postmaster/pgstat.c               |  4 ++
+ src/backend/postmaster/walwriter.c            |  3 +
+ src/backend/replication/walreceiver.c         | 34 +++++++++++
+ src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++-
+ src/backend/utils/misc/guc.c                  |  9 +++
  src/backend/utils/misc/postgresql.conf.sample |  1 +
  src/include/access/xlog.h                     |  1 +
  src/include/catalog/pg_proc.dat               | 14 ++---
- src/include/pgstat.h                          | 10 +++-
- src/test/regress/expected/rules.out           |  6 ++-
- 15 files changed, 232 insertions(+), 18 deletions(-)
+ src/include/pgstat.h                          |  8 +++
+ src/test/regress/expected/rules.out           |  6 +-
+ 15 files changed, 221 insertions(+), 14 deletions(-)
 
 diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
 index 82864bbb24..43f3fbcaf8 100644
@@ -133,7 +133,7 @@
       </row>
  
 diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
-index 470e113b33..1c4860bee7 100644
+index 470e113b33..f780a2eb4f 100644
 --- a/src/backend/access/transam/xlog.c
 +++ b/src/backend/access/transam/xlog.c
 @@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
@@ -165,7 +165,7 @@
  				written = pg_pwrite(openLogFile, from, nleft, startoffset);
  				pgstat_report_wait_end();
 +
-+				/* increment the i/o timing and the number of WAL data written */
++				/* Increment the i/o timing and the number of WAL data written */
 +				if (track_wal_io_timing)
 +				{
 +					instr_time	duration;
@@ -180,47 +180,52 @@
  				if (written <= 0)
  				{
  					char		xlogfname[MAXFNAMELEN];
-@@ -10565,7 +10585,12 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
+@@ -10565,7 +10585,22 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
  void
  issue_xlog_fsync(int fd, XLogSegNo segno)
  {
 -	char	   *msg = NULL;
 +	char		*msg = NULL;
++	bool		issue_fsync = false;
 +	instr_time	start;
 +
-+	/* Measure i/o timing to sync WAL data.*/
-+	if (track_wal_io_timing)
-+		INSTR_TIME_SET_CURRENT(start);
++	/* Check whether to sync WAL data to the disk right now */
++	if (enableFsync &&
++		(sync_method == SYNC_METHOD_FSYNC ||
++		 sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
++		 sync_method == SYNC_METHOD_FDATASYNC))
++	{
++		/* Measure i/o timing to sync WAL data */
++		if (track_wal_io_timing)
++			INSTR_TIME_SET_CURRENT(start);
++
++		issue_fsync = true;
++	}
  
  	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
  	switch (sync_method)
-@@ -10610,6 +10635,30 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
+@@ -10610,6 +10645,25 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
  	}
  
  	pgstat_report_wait_end();
 +
-+	/* 
-+	 * check whether to sync WAL data to the disk right now because 
++	/*
++	 * Increment the i/o timing and the number of WAL data synced.
++	 *
++	 * Check whether to sync WAL data to the disk right now because
 +	 * statistics must be incremented when syncing really occurred.
 +	 */
-+	if (enableFsync)
++	if (issue_fsync)
 +	{
-+		if ((sync_method == SYNC_METHOD_FSYNC) ||
-+			(sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH) ||
-+			(sync_method == SYNC_METHOD_FDATASYNC))
++		if (track_wal_io_timing)
 +		{
-+			/* increment the i/o timing and the number of WAL data synced */
-+			if (track_wal_io_timing)
-+			{
-+				instr_time	duration;
-+
-+				INSTR_TIME_SET_CURRENT(duration);
-+				INSTR_TIME_SUBTRACT(duration, start);
-+				WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
-+			}
++			instr_time duration;
 +
-+			WalStats.m_wal_sync++;
++			INSTR_TIME_SET_CURRENT(duration);
++			INSTR_TIME_SUBTRACT(duration, start);
++			WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
 +		}
++		WalStats.m_wal_sync++;
 +	}
  }
  
@@ -241,68 +246,23 @@
      FROM pg_stat_get_wal() w;
  
 diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
-index 54a818bf61..5d14a97e56 100644
+index 54a818bf61..80da8acaa4 100644
 --- a/src/backend/postmaster/checkpointer.c
 +++ b/src/backend/postmaster/checkpointer.c
-@@ -505,7 +505,7 @@ CheckpointerMain(void)
+@@ -504,7 +504,7 @@ CheckpointerMain(void)
+ 		 */
  		pgstat_send_bgwriter();
  
- 		/* Send WAL statistics to the stats collector. */
--		pgstat_send_wal();
-+		pgstat_send_wal(true);
+-		/* Send WAL statistics to the stats collector. */
++		/* Send WAL statistics to stats collector */
+ 		pgstat_send_wal();
  
  		/*
- 		 * If any checkpoint flags have been set, redo the loop to handle the
 diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
-index f75b52719d..256d8706ca 100644
+index f75b52719d..987bbd058d 100644
 --- a/src/backend/postmaster/pgstat.c
 +++ b/src/backend/postmaster/pgstat.c
-@@ -975,7 +975,7 @@ pgstat_report_stat(bool disconnect)
- 	pgstat_send_funcstats();
- 
- 	/* Send WAL statistics */
--	pgstat_send_wal();
-+	pgstat_send_wal(true);
- 
- 	/* Finally send SLRU statistics */
- 	pgstat_send_slru();
-@@ -4669,17 +4669,33 @@ pgstat_send_bgwriter(void)
- /* ----------
-  * pgstat_send_wal() -
-  *
-- *		Send WAL statistics to the collector
-+ *		Send WAL statistics to the collector.
-+ *
-+ *		If force is false, don't send a message unless it's been at 
-+ *		least PGSTAT_STAT_INTERVAL msec since we last sent one.
-  * ----------
-  */
- void
--pgstat_send_wal(void)
-+pgstat_send_wal(bool force)
- {
- 	/* We assume this initializes to zeroes */
- 	static const PgStat_MsgWal all_zeroes;
-+	static TimestampTz last_report = 0;
- 
-+	TimestampTz	now;
- 	WalUsage	walusage;
- 
-+	/*
-+	 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-+	 * msec since we last sent one or specified "force".
-+	 */
-+	now = GetCurrentTimestamp();
-+	if (!force &&
-+		!TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-+		return;
-+
-+	last_report = now;
-+
- 	/*
- 	 * Calculate how much WAL usage counters are increased by substracting the
- 	 * previous counters from the current ones. Fill the results in WAL stats
-@@ -6892,6 +6908,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
+@@ -6892,6 +6892,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
  	walStats.wal_fpi += msg->m_wal_fpi;
  	walStats.wal_bytes += msg->m_wal_bytes;
  	walStats.wal_buffers_full += msg->m_wal_buffers_full;
@@ -314,7 +274,7 @@
  
  /* ----------
 diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
-index 4f1a8e356b..7fd56d1497 100644
+index 4f1a8e356b..104cba4581 100644
 --- a/src/backend/postmaster/walwriter.c
 +++ b/src/backend/postmaster/walwriter.c
 @@ -253,6 +253,9 @@ WalWriterMain(void)
@@ -322,35 +282,42 @@
  			left_till_hibernate--;
  
 +		/* Send WAL statistics */
-+		pgstat_send_wal(true);
++		pgstat_send_wal();
 +
  		/*
  		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
  		 * haven't done anything useful for quite some time, lengthen the
 diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
-index 723f513d8b..901b194773 100644
+index 723f513d8b..61e98c6eca 100644
 --- a/src/backend/replication/walreceiver.c
 +++ b/src/backend/replication/walreceiver.c
-@@ -485,7 +485,18 @@ WalReceiverMain(void)
+@@ -485,7 +485,11 @@ WalReceiverMain(void)
  
  				/* Check if we need to exit the streaming loop. */
  				if (endofwal)
 +				{
-+					/* Send WAL statistics to the stats collector. */
-+					pgstat_send_wal(true);
++					/* Send WAL statistics to stats collector */
++					pgstat_send_wal();
  					break;
 +				}
-+
-+				/* 
-+				 * Send WAL statistics to the stats collector.
-+				 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-+				 * msec since we last sent one.
-+				 */
-+				pgstat_send_wal(false);
  
  				/*
  				 * Ideally we would reuse a WaitEventSet object repeatedly
-@@ -874,6 +885,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
+@@ -550,8 +554,13 @@ WalReceiverMain(void)
+ 														wal_receiver_timeout);
+ 
+ 						if (now >= timeout)
++						{
++							/* Send WAL statistics to stats collector before terminating */
++							pgstat_send_wal();
++
+ 							ereport(ERROR,
+ 									(errmsg("terminating walreceiver due to timeout")));
++						}
+ 
+ 						/*
+ 						 * We didn't receive anything new, for half of
+@@ -874,6 +883,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
  	while (nbytes > 0)
  	{
  		int			segbytes;
@@ -358,11 +325,24 @@
  
  		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
  		{
-@@ -931,7 +943,25 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
+@@ -910,6 +920,13 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
+ 					XLogArchiveForceDone(xlogfname);
+ 				else
+ 					XLogArchiveNotify(xlogfname);
++
++				/*
++				 * Send WAL statistics to stats collector when finishing the
++				 * current WAL segment file to avoid loading stats collector.
++				 */
++				pgstat_send_wal();
++
+ 			}
+ 			recvFile = -1;
+ 
+@@ -931,7 +948,24 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
  		/* OK to write the logs */
  		errno = 0;
  
-+
 +		/* Measure i/o timing to write WAL data */
 +		if (track_wal_io_timing)
 +			INSTR_TIME_SET_CURRENT(start);
@@ -503,7 +483,7 @@
  { oid => '2306', descr => 'statistics: information about SLRU caches',
    proname => 'pg_stat_get_slru', prorows => '100', proisstrict => 'f',
 diff --git a/src/include/pgstat.h b/src/include/pgstat.h
-index 724068cf87..8ef959c0cc 100644
+index 724068cf87..e689d27480 100644
 --- a/src/include/pgstat.h
 +++ b/src/include/pgstat.h
 @@ -474,6 +474,10 @@ typedef struct PgStat_MsgWal
@@ -528,15 +508,6 @@
  	TimestampTz stat_reset_timestamp;
  } PgStat_WalStats;
  
-@@ -1590,7 +1598,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
- 
- extern void pgstat_send_archiver(const char *xlog, bool failed);
- extern void pgstat_send_bgwriter(void);
--extern void pgstat_send_wal(void);
-+extern void pgstat_send_wal(bool force);
- 
- /* ----------
-  * Support functions for the SQL-callable functions to
 diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
 index 6173473de9..bc3909fd17 100644
 --- a/src/test/regress/expected/rules.out

v7-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; name=v7-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload

From 02f0888efeb09ae641d9ef905788d995d687c56f Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Date: Tue, 26 Jan 2021 08:18:37 +0900
Subject: [PATCH] Add statistics related to write/sync wal records.

This patch adds following statistics to pg_stat_wal view
to track WAL I/O activity.

- the total number of writing/syncing WAL data.
- the total amount of time that has been spent in
  writing/syncing WAL data.

Since to track I/O timing may leads significant overhead,
GUC parameter "track_wal_io_timing" is introduced.
Only if this is on, the I/O timing is measured.

The statistics related to sync are zero when "wal_sync_method"
is "open_datasync" or "open_sync", because it leads syncing
WAL data at same time when to write it.

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada
Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com

(This requires a catversion bump, as well as an update to PGSTAT_FILE_FORMAT_ID
---
 doc/src/sgml/config.sgml                      | 21 +++++++
 doc/src/sgml/monitoring.sgml                  | 48 +++++++++++++++-
 src/backend/access/transam/xlog.c             | 56 ++++++++++++++++++-
 src/backend/catalog/system_views.sql          |  4 ++
 src/backend/postmaster/checkpointer.c         |  2 +-
 src/backend/postmaster/pgstat.c               |  4 ++
 src/backend/postmaster/walwriter.c            |  3 +
 src/backend/replication/walreceiver.c         | 34 +++++++++++
 src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++-
 src/backend/utils/misc/guc.c                  |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h                     |  1 +
 src/include/catalog/pg_proc.dat               | 14 ++---
 src/include/pgstat.h                          |  8 +++
 src/test/regress/expected/rules.out           |  6 +-
 15 files changed, 221 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 82864bbb24..43f3fbcaf8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7416,6 +7416,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        because it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index f05140dd42..5a8fc4eb0c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>wal_buffers_full</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of times WAL data was written to disk because WAL buffers became full
+       Total number of WAL data written to disk because WAL buffers became full
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of WAL data written to disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was written to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of WAL data synced to disk
+       (if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>, this value is zero because WAL data is synced 
+       when to write it).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was synced to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero.
+       if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>, this value is zero too because WAL data is synced 
+       when to write it).
       </para></entry>
      </row>
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 470e113b33..f780a2eb4f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2540,6 +2541,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2548,9 +2550,27 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure i/o timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				/* Increment the i/o timing and the number of WAL data written */
+				if (track_wal_io_timing)
+				{
+					instr_time	duration;
+
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10565,7 +10585,22 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
-	char	   *msg = NULL;
+	char		*msg = NULL;
+	bool		issue_fsync = false;
+	instr_time	start;
+
+	/* Check whether to sync WAL data to the disk right now */
+	if (enableFsync &&
+		(sync_method == SYNC_METHOD_FSYNC ||
+		 sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+		 sync_method == SYNC_METHOD_FDATASYNC))
+	{
+		/* Measure i/o timing to sync WAL data */
+		if (track_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
+		issue_fsync = true;
+	}
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10610,6 +10645,25 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/*
+	 * Increment the i/o timing and the number of WAL data synced.
+	 *
+	 * Check whether to sync WAL data to the disk right now because
+	 * statistics must be incremented when syncing really occurred.
+	 */
+	if (issue_fsync)
+	{
+		if (track_wal_io_timing)
+		{
+			instr_time duration;
+
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
+		}
+		WalStats.m_wal_sync++;
+	}
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd9d7..b8ace4fc41 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1004,6 +1004,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 54a818bf61..80da8acaa4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -504,7 +504,7 @@ CheckpointerMain(void)
 		 */
 		pgstat_send_bgwriter();
 
-		/* Send WAL statistics to the stats collector. */
+		/* Send WAL statistics to stats collector */
 		pgstat_send_wal();
 
 		/*
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..987bbd058d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -6892,6 +6892,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..104cba4581 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,6 +253,9 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/* Send WAL statistics */
+		pgstat_send_wal();
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 723f513d8b..61e98c6eca 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -485,7 +485,11 @@ WalReceiverMain(void)
 
 				/* Check if we need to exit the streaming loop. */
 				if (endofwal)
+				{
+					/* Send WAL statistics to stats collector */
+					pgstat_send_wal();
 					break;
+				}
 
 				/*
 				 * Ideally we would reuse a WaitEventSet object repeatedly
@@ -550,8 +554,13 @@ WalReceiverMain(void)
 														wal_receiver_timeout);
 
 						if (now >= timeout)
+						{
+							/* Send WAL statistics to stats collector before terminating */
+							pgstat_send_wal();
+
 							ereport(ERROR,
 									(errmsg("terminating walreceiver due to timeout")));
+						}
 
 						/*
 						 * We didn't receive anything new, for half of
@@ -874,6 +883,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	while (nbytes > 0)
 	{
 		int			segbytes;
+		instr_time	start;
 
 		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
 		{
@@ -910,6 +920,13 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveForceDone(xlogfname);
 				else
 					XLogArchiveNotify(xlogfname);
+
+				/*
+				 * Send WAL statistics to stats collector when finishing the
+				 * current WAL segment file to avoid loading stats collector.
+				 */
+				pgstat_send_wal();
+
 			}
 			recvFile = -1;
 
@@ -931,7 +948,24 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		/* OK to write the logs */
 		errno = 0;
 
+		/* Measure i/o timing to write WAL data */
+		if (track_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
 		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+
+		/* increment the i/o timing and the number of WAL data written */
+		if (track_wal_io_timing)
+		{
+			instr_time	duration;
+
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
+		}
+
+		WalStats.m_wal_write++;
+
 		if (byteswritten <= 0)
 		{
 			char		xlogfname[MAXFNAMELEN];
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..7296ef04ff 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,17 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+
+	/* convert counter from microsec to millisec for display */
+	values[5] = Float8GetDatum((double) wal_stats->wal_write_time / 1000.0);
+
+	values[6] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* convert counter from microsec to millisec for display */
+	values[7] = Float8GetDatum((double) wal_stats->wal_sync_time / 1000.0);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17579eeaca..ac6f0cd4ef 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8930a94fff..4dc79cf822 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -585,6 +585,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b5f52d4e4a..9fe8a72105 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5531,13 +5531,13 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
-{ oid => '1136', descr => 'statistics: information about WAL activity',
-  proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
-  proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
-  prosrc => 'pg_stat_get_wal' },
+ { oid => '1136', descr => 'statistics: information about WAL activity',
+   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
+   proparallel => 'r', prorettype => 'record', proargtypes => '',
+   proallargtypes => '{int8,int8,numeric,int8,int8,float8,int8,float8,timestamptz}',
+   proargmodes => '{o,o,o,o,o,o,o,o,o}',
+   proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_write_time,wal_sync,wal_sync_time,stats_reset}',
+   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
   proname => 'pg_stat_get_slru', prorows => '100', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..e689d27480 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,10 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_write_time;	/* time spend writing wal records in micro seconds */
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_sync_time;		/* time spend syncing wal records in micro seconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +843,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6173473de9..bc3909fd17 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2158,8 +2158,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_write_time,
+    w.wal_sync,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_write_time, wal_sync, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,
-- 
2.25.1

#23

David G. Johnston

david.g.johnston@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#21)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On Mon, Jan 25, 2021 at 8:03 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com>
wrote:

Hi, thanks for the reviews.

I updated the attached patch.

Thank you for updating the patch!

Your original email with "total number of times" is more correct, removing
the "of times" and just writing "total number of WAL" is not good wording.

Specifically, this change is strictly worse than the original.

-       Number of times WAL data was written to disk because WAL buffers
became full
+       Total number of WAL data written to disk because WAL buffers became
full

Both have the flaw that they leave implied exactly what it means to "write
WAL to disk". It is also unclear whether a counter, bytes, or both, would
be more useful here. I've incorporated this into my documentation
suggestions below:

(wal_buffers_full)
-- Revert - the original was better, though maybe add more detail similar
to the below. I didn't research exactly how this works.

(wal_write)
The number of times WAL buffers were written out to disk via XLogWrite

-- Seems like this should have a bytes version too

(wal_write_time)
The amount of time spent writing WAL buffers to disk, excluding sync time
unless the wal_sync_method is either open_datasync or open_sync.
Units are in milliseconds with microsecond resolution. This is zero when
track_wal_io_timing is disabled.

(wal_sync)
The number of times WAL files were synced to disk while wal_sync_method was
set to one of the "sync at commit" options (i.e., fdatasync, fsync,
or fsync_writethrough).

-- it is not going to be zero just because those settings are presently
disabled as they could have been enabled at some point since the last time
these statistics were reset.

(wal_sync_time)
The amount of time spent syncing WAL files to disk, in milliseconds with
microsecond resolution. This requires setting wal_sync_method to one of
the "sync at commit" options (i.e., fdatasync, fsync,
or fsync_writethrough).

Also,

I would suggest extracting the changes to postmaster/pgstat.c and
replication/walreceiver.c to a separate patch as you've fundamentally
changed how it behaves with regards to that function and how it interacts
with the WAL receiver. That seems an entirely separate topic warranting
its own patch and discussion.

David J.

#24

David G. Johnston

david.g.johnston@gmail.com

almost 5 years ago

In reply to: Masahiro Ikeda (#22)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On Mon, Jan 25, 2021 at 4:37 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com>
wrote:

I agree with your comments. I think it should report when
reaching the end of WAL too. I add the code to report the stats
when finishing the current WAL segment file when timeout in the
main loop and when reaching the end of WAL.

The following is not an improvement:

- /* Send WAL statistics to the stats collector. */
+ /* Send WAL statistics to stats collector */

The word "the" there makes it proper English. Your copy-pasting should
have kept the existing good wording in the other locations rather than
replace the existing location with the newly added incorrect wording.

This doesn't make sense:

* current WAL segment file to avoid loading stats collector.

Maybe "overloading" or "overwhelming"?

I see you removed the pgstat_send_wal(force) change. The rest of my
comments on the v6 patch still stand I believe.

David J.

#25

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: David G. Johnston (#23)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

Hi, David.

Thanks for your comments.

On 2021-01-26 08:48, David G. Johnston wrote:

On Mon, Jan 25, 2021 at 8:03 AM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi, thanks for the reviews.

I updated the attached patch.

Thank you for updating the patch!

Your original email with "total number of times" is more correct,
removing the "of times" and just writing "total number of WAL" is not
good wording.

Specifically, this change is strictly worse than the original.
-       Number of times WAL data was written to disk because WAL
buffers became full
+       Total number of WAL data written to disk because WAL buffers
became full
Both have the flaw that they leave implied exactly what it means to
"write WAL to disk". It is also unclear whether a counter, bytes, or
both, would be more useful here. I've incorporated this into my
documentation suggestions below:
(wal_buffers_full)

-- Revert - the original was better, though maybe add more detail
similar to the below. I didn't research exactly how this works.

OK, I understood.
I reverted since this is a counter statistics.

(wal_write)
The number of times WAL buffers were written out to disk via XLogWrite

Thanks.

I thought it's better to omit "The" and "XLogWrite" because other views'
description
omits "The" and there is no description of "XlogWrite" in the documents.
What do you think?

-- Seems like this should have a bytes version too

Do you mean that we need to separate statistics for wal write?

(wal_write_time)
The amount of time spent writing WAL buffers to disk, excluding sync
time unless the wal_sync_method is either open_datasync or open_sync.
Units are in milliseconds with microsecond resolution. This is zero
when track_wal_io_timing is disabled.

Thanks, I'll fix it.

(wal_sync)
The number of times WAL files were synced to disk while
wal_sync_method was set to one of the "sync at commit" options (i.e.,
fdatasync, fsync, or fsync_writethrough).

Thanks, I'll fix it.

-- it is not going to be zero just because those settings are
presently disabled as they could have been enabled at some point since
the last time these statistics were reset.

Right, your description is correct.
The "track_wal_io_timing" has the same limitation, doesn't it?

(wal_sync_time)
The amount of time spent syncing WAL files to disk, in milliseconds
with microsecond resolution. This requires setting wal_sync_method to
one of the "sync at commit" options (i.e., fdatasync, fsync, or
fsync_writethrough).

Thanks, I'll fix it.
I will add the comments related to "track_wal_io_timing".

Also,

I would suggest extracting the changes to postmaster/pgstat.c and
replication/walreceiver.c to a separate patch as you've fundamentally
changed how it behaves with regards to that function and how it
interacts with the WAL receiver. That seems an entirely separate
topic warranting its own patch and discussion.

OK, I will separate two patches.

On 2021-01-26 08:52, David G. Johnston wrote:

On Mon, Jan 25, 2021 at 4:37 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I agree with your comments. I think it should report when
reaching the end of WAL too. I add the code to report the stats
when finishing the current WAL segment file when timeout in the
main loop and when reaching the end of WAL.

The following is not an improvement:

- /* Send WAL statistics to the stats collector. */+ /* Send WAL
statistics to stats collector */

The word "the" there makes it proper English. Your copy-pasting
should have kept the existing good wording in the other locations
rather than replace the existing location with the newly added
incorrect wording.

Thanks, I'll fix it.

This doesn't make sense:

* current WAL segment file to avoid loading stats collector.

Maybe "overloading" or "overwhelming"?

I see you removed the pgstat_send_wal(force) change. The rest of my
comments on the v6 patch still stand I believe.

Yes, "overloading" is right. Thanks.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#26

David G. Johnston

david.g.johnston@gmail.com

almost 5 years ago

In reply to: Masahiro Ikeda (#25)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On Mon, Jan 25, 2021 at 11:56 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com>
wrote:

(wal_write)
The number of times WAL buffers were written out to disk via XLogWrite

Thanks.

I thought it's better to omit "The" and "XLogWrite" because other views'
description
omits "The" and there is no description of "XlogWrite" in the documents.
What do you think?

The documentation for WAL does get into the public API level of detail and
doing so here makes what this measures crystal clear. The potential
absence of sufficient detail elsewhere should be corrected instead of
making this description more vague. Specifically, probably XLogWrite
should be added to the WAL overview as part of this update and probably
even have the descriptive section of the documentation note that the number
of times that said function is executed is exposed as a counter in the wal
statistics table - thus closing the loop.

David J.

#27

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: David G. Johnston (#26)

2 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-01-27 00:14, David G. Johnston wrote:

On Mon, Jan 25, 2021 at 11:56 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

(wal_write)
The number of times WAL buffers were written out to disk via

XLogWrite

Thanks.

I thought it's better to omit "The" and "XLogWrite" because other
views'
description
omits "The" and there is no description of "XlogWrite" in the
documents.
What do you think?

The documentation for WAL does get into the public API level of detail
and doing so here makes what this measures crystal clear. The
potential absence of sufficient detail elsewhere should be corrected
instead of making this description more vague. Specifically, probably
XLogWrite should be added to the WAL overview as part of this update
and probably even have the descriptive section of the documentation
note that the number of times that said function is executed is
exposed as a counter in the wal statistics table - thus closing the
loop.

Thanks for your comments.

I added the descriptions in documents and separated the patch
into attached two patches. First is to add wal i/o activity statistics.
Second is to make the wal receiver report the wal statistics.

Please let me know if you have any comments.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v8-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; name=v8-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload

From a93516dd9836345f97a6f4081597a7079dff4932 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Date: Fri, 29 Jan 2021 16:41:34 +0900
Subject: [PATCH 1/2] Add statistics related to write/sync wal records.

This patch adds following statistics to pg_stat_wal view
to track WAL I/O activity.

- the total number of times writing/syncing WAL data.
- the total amount of time spent writing/syncing WAL data.

Since to track I/O timing may leads significant overhead,
GUC parameter "track_wal_io_timing" is introduced.
Only if this is on, the I/O timing is measured.

The statistics related to sync are zero when "wal_sync_method"
is "open_datasync" or "open_sync", because it doesn't call each
sync method.

(This requires a catversion bump, as well as an update to
 PGSTAT_FILE_FORMAT_ID)

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David Johnston
Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com
---
 doc/src/sgml/config.sgml                      | 21 +++++++
 doc/src/sgml/monitoring.sgml                  | 50 ++++++++++++++++
 doc/src/sgml/wal.sgml                         | 12 +++-
 src/backend/access/transam/xlog.c             | 59 ++++++++++++++++++-
 src/backend/catalog/system_views.sql          |  4 ++
 src/backend/postmaster/checkpointer.c         |  2 +-
 src/backend/postmaster/pgstat.c               |  4 ++
 src/backend/postmaster/walwriter.c            |  3 +
 src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++-
 src/backend/utils/misc/guc.c                  |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h                     |  1 +
 src/include/catalog/pg_proc.dat               | 14 ++---
 src/include/pgstat.h                          |  8 +++
 src/test/regress/expected/rules.out           |  6 +-
 15 files changed, 202 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f1037df5a9..3f1b3c1715 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7416,6 +7416,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        because it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c602ee4427..2435f401db 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3487,6 +3487,56 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL buffers were written out to disk via 
+       <function>XLogWrite</function>, which nomally called by an 
+       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>)
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent writing WAL data to disk, excluding sync time unless 
+       <xref linkend="guc-wal-sync-method"/> is ether <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>. Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL files were synced to disk via 
+       <function>issue_xlog_fsync</function>, which nomally called by an 
+       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>),
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the 
+       "sync at commit" options (i.e., <literal>fdatasync</literal>, 
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent syncing WAL files to disk, in milliseconds with microsecond 
+       resolution. This requires setting <xref linkend="guc-wal-sync-method"/> to one of 
+       the "sync at commit" options (i.e., <literal>fdatasync</literal>, <literal>fsync</literal>,
+       or <literal>fsync_writethrough</literal>).
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 66de1ee2f8..984cb5764c 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -663,7 +663,9 @@
    the <acronym>WAL</acronym> buffers in shared memory. If there is no
    space for the new record, <function>XLogInsertRecord</function> will have
    to write (move to kernel cache) a few filled <acronym>WAL</acronym>
-   buffers. This is undesirable because <function>XLogInsertRecord</function>
+   buffers. The number of times it happend is counted as 
+   <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>.
+   This is undesirable because <function>XLogInsertRecord</function> 
    is used on every database low level modification (for example, row
    insertion) at a time when an exclusive lock is held on affected
    data pages, so the operation needs to be as fast as possible.  What
@@ -672,8 +674,12 @@
    time. Normally, <acronym>WAL</acronym> buffers should be written
    and flushed by an <function>XLogFlush</function> request, which is
    made, for the most part, at transaction commit time to ensure that
-   transaction records are flushed to permanent storage. On systems
-   with high log output, <function>XLogFlush</function> requests might
+   transaction records are flushed to permanent storage. 
+   <function>XLogFlush</function> calls <function>XLogWrite</function> to write 
+   and <function>issue_xlog_fsync</function> to flush them, which are counted as 
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in 
+   <xref linkend="pg-stat-wal-view"/>. On systems with high log output, 
+   <function>XLogFlush</function> requests might
    not occur often enough to prevent <function>XLogInsertRecord</function>
    from having to do writes.  On such systems
    one should increase the number of <acronym>WAL</acronym> buffers by
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f03bd473e2..62ad246706 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2536,6 +2537,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2544,9 +2546,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure I/O timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				/* 
+				 * Increment the I/O timing and the number of times 
+				 * WAL data were written out to disk.
+				 */
+				if (track_wal_io_timing)
+				{
+					instr_time	duration;
+
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10542,7 +10565,22 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
-	char	   *msg = NULL;
+	char		*msg = NULL;
+	bool		issue_fsync = false;
+	instr_time	start;
+
+	/* Check whether the WAL file was synced to disk right now */
+	if (enableFsync &&
+		(sync_method == SYNC_METHOD_FSYNC ||
+		 sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+		 sync_method == SYNC_METHOD_FDATASYNC))
+	{
+		/* Measure I/O timing to sync the WAL file */
+		if (track_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
+		issue_fsync = true;
+	}
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10587,6 +10625,25 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/*
+	 * Increment the I/O timing and the number of times WAL files were synced.
+	 *
+	 * Check whether the WAL file was synced to disk right now because
+	 * statistics must be incremented when syncing really occurred.
+	 */
+	if (issue_fsync)
+	{
+		if (track_wal_io_timing)
+		{
+			instr_time duration;
+
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
+		}
+		WalStats.m_wal_sync++;
+	}
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd9d7..b8ace4fc41 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1004,6 +1004,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 54a818bf61..7f0996ce3c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -504,7 +504,7 @@ CheckpointerMain(void)
 		 */
 		pgstat_send_bgwriter();
 
-		/* Send WAL statistics to the stats collector. */
+		/* Send WAL statistics to the stats collector */
 		pgstat_send_wal();
 
 		/*
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..987bbd058d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -6892,6 +6892,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..104cba4581 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,6 +253,9 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/* Send WAL statistics */
+		pgstat_send_wal();
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..7296ef04ff 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,17 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+
+	/* convert counter from microsec to millisec for display */
+	values[5] = Float8GetDatum((double) wal_stats->wal_write_time / 1000.0);
+
+	values[6] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* convert counter from microsec to millisec for display */
+	values[7] = Float8GetDatum((double) wal_stats->wal_sync_time / 1000.0);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index eafdb1118e..3bdac8854d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index bd57e917e1..a4c1d6650c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -585,6 +585,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b5f52d4e4a..9fe8a72105 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5531,13 +5531,13 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
-{ oid => '1136', descr => 'statistics: information about WAL activity',
-  proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
-  proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
-  prosrc => 'pg_stat_get_wal' },
+ { oid => '1136', descr => 'statistics: information about WAL activity',
+   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
+   proparallel => 'r', prorettype => 'record', proargtypes => '',
+   proallargtypes => '{int8,int8,numeric,int8,int8,float8,int8,float8,timestamptz}',
+   proargmodes => '{o,o,o,o,o,o,o,o,o}',
+   proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_write_time,wal_sync,wal_sync_time,stats_reset}',
+   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
   proname => 'pg_stat_get_slru', prorows => '100', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..e689d27480 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,10 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_write_time;	/* time spend writing wal records in micro seconds */
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_sync_time;		/* time spend syncing wal records in micro seconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +843,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6173473de9..bc3909fd17 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2158,8 +2158,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_write_time,
+    w.wal_sync,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_write_time, wal_sync, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,
-- 
2.25.1

v8-0002-Makes-the-wal-receiver-report-WAL-statistics.patchtext/x-diff; name=v8-0002-Makes-the-wal-receiver-report-WAL-statistics.patchDownload

From b8b2c5d911b6ed1686b0c144d7ae3cb581a3ed6b Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Date: Fri, 29 Jan 2021 16:46:30 +0900
Subject: [PATCH 2/2] Makes the wal receiver report WAL statistics

This patch makes the WAL receiver report WAL statistics
and fundamentally changes how the stas collector's behaves
with regards to that function and how it interacts with
the WAL receiver.

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David Johnston
Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com
---
 doc/src/sgml/monitoring.sgml          |  3 ++-
 src/backend/replication/walreceiver.c | 37 +++++++++++++++++++++++++++
 2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2435f401db..da48e6f946 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3494,7 +3494,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       <para>
        Number of times WAL buffers were written out to disk via 
        <function>XLogWrite</function>, which nomally called by an 
-       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>)
+       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>),
+       or WAL data written out to disk by WAL receiver.
       </para></entry>
      </row>
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 723f513d8b..5433f4ccca 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -485,7 +485,11 @@ WalReceiverMain(void)
 
 				/* Check if we need to exit the streaming loop. */
 				if (endofwal)
+				{
+					/* Send WAL statistics to the stats collector */
+					pgstat_send_wal();
 					break;
+				}
 
 				/*
 				 * Ideally we would reuse a WaitEventSet object repeatedly
@@ -550,8 +554,13 @@ WalReceiverMain(void)
 														wal_receiver_timeout);
 
 						if (now >= timeout)
+						{
+							/* Send WAL statistics to the stats collector before terminating */
+							pgstat_send_wal();
+
 							ereport(ERROR,
 									(errmsg("terminating walreceiver due to timeout")));
+						}
 
 						/*
 						 * We didn't receive anything new, for half of
@@ -874,6 +883,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	while (nbytes > 0)
 	{
 		int			segbytes;
+		instr_time	start;
 
 		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
 		{
@@ -910,6 +920,13 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveForceDone(xlogfname);
 				else
 					XLogArchiveNotify(xlogfname);
+
+				/*
+				 * Send WAL statistics to the stats collector when finishing the
+				 * current WAL segment file to avoid overloading it.
+				 */
+				pgstat_send_wal();
+
 			}
 			recvFile = -1;
 
@@ -931,7 +948,27 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		/* OK to write the logs */
 		errno = 0;
 
+		/* Measure I/O timing to write WAL data */
+		if (track_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
 		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+
+		/* 
+		 * Increment the I/O timing and the number of times 
+		 * WAL data were written out to disk.
+		 */
+		if (track_wal_io_timing)
+		{
+			instr_time	duration;
+
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
+		}
+
+		WalStats.m_wal_write++;
+
 		if (byteswritten <= 0)
 		{
 			char		xlogfname[MAXFNAMELEN];
-- 
2.25.1

#28

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#27)

2 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

I pgindented the patches.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v9-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; name=v9-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload

From 47f436d7e423ece33a25adebf4265eac02e575f3 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Date: Fri, 29 Jan 2021 16:41:34 +0900
Subject: [PATCH 1/2] Add statistics related to write/sync wal records.

This patch adds following statistics to pg_stat_wal view
to track WAL I/O activity.

- the total number of times writing/syncing WAL data.
- the total amount of time spent writing/syncing WAL data.

Since to track I/O timing may leads significant overhead,
GUC parameter "track_wal_io_timing" is introduced.
Only if this is on, the I/O timing is measured.

The statistics related to sync are zero when "wal_sync_method"
is "open_datasync" or "open_sync", because it doesn't call each
sync method.

(This requires a catversion bump, as well as an update to
 PGSTAT_FILE_FORMAT_ID)

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David Johnston
Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com
---
 doc/src/sgml/config.sgml                      | 21 +++++++
 doc/src/sgml/monitoring.sgml                  | 50 ++++++++++++++++
 doc/src/sgml/wal.sgml                         | 12 +++-
 src/backend/access/transam/xlog.c             | 57 +++++++++++++++++++
 src/backend/catalog/system_views.sql          |  4 ++
 src/backend/postmaster/checkpointer.c         |  2 +-
 src/backend/postmaster/pgstat.c               |  4 ++
 src/backend/postmaster/walwriter.c            |  3 +
 src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++-
 src/backend/utils/misc/guc.c                  |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h                     |  1 +
 src/include/catalog/pg_proc.dat               |  6 +-
 src/include/pgstat.h                          | 10 ++++
 src/test/regress/expected/rules.out           |  6 +-
 15 files changed, 199 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5ef1c7ad3c..4bdc341141 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7418,6 +7418,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        because it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c602ee4427..2435f401db 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3487,6 +3487,56 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL buffers were written out to disk via 
+       <function>XLogWrite</function>, which nomally called by an 
+       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>)
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent writing WAL data to disk, excluding sync time unless 
+       <xref linkend="guc-wal-sync-method"/> is ether <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>. Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL files were synced to disk via 
+       <function>issue_xlog_fsync</function>, which nomally called by an 
+       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>),
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the 
+       "sync at commit" options (i.e., <literal>fdatasync</literal>, 
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent syncing WAL files to disk, in milliseconds with microsecond 
+       resolution. This requires setting <xref linkend="guc-wal-sync-method"/> to one of 
+       the "sync at commit" options (i.e., <literal>fdatasync</literal>, <literal>fsync</literal>,
+       or <literal>fsync_writethrough</literal>).
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 66de1ee2f8..984cb5764c 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -663,7 +663,9 @@
    the <acronym>WAL</acronym> buffers in shared memory. If there is no
    space for the new record, <function>XLogInsertRecord</function> will have
    to write (move to kernel cache) a few filled <acronym>WAL</acronym>
-   buffers. This is undesirable because <function>XLogInsertRecord</function>
+   buffers. The number of times it happend is counted as 
+   <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>.
+   This is undesirable because <function>XLogInsertRecord</function> 
    is used on every database low level modification (for example, row
    insertion) at a time when an exclusive lock is held on affected
    data pages, so the operation needs to be as fast as possible.  What
@@ -672,8 +674,12 @@
    time. Normally, <acronym>WAL</acronym> buffers should be written
    and flushed by an <function>XLogFlush</function> request, which is
    made, for the most part, at transaction commit time to ensure that
-   transaction records are flushed to permanent storage. On systems
-   with high log output, <function>XLogFlush</function> requests might
+   transaction records are flushed to permanent storage. 
+   <function>XLogFlush</function> calls <function>XLogWrite</function> to write 
+   and <function>issue_xlog_fsync</function> to flush them, which are counted as 
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in 
+   <xref linkend="pg-stat-wal-view"/>. On systems with high log output, 
+   <function>XLogFlush</function> requests might
    not occur often enough to prevent <function>XLogInsertRecord</function>
    from having to do writes.  On such systems
    one should increase the number of <acronym>WAL</acronym> buffers by
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eda322b910..c396ff4090 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2536,6 +2537,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2544,9 +2546,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure I/O timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				/*
+				 * Increment the I/O timing and the number of times WAL data
+				 * were written out to disk.
+				 */
+				if (track_wal_io_timing)
+				{
+					instr_time	duration;
+
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10545,6 +10568,21 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
 	char	   *msg = NULL;
+	bool		issue_fsync = false;
+	instr_time	start;
+
+	/* Check whether the WAL file was synced to disk right now */
+	if (enableFsync &&
+		(sync_method == SYNC_METHOD_FSYNC ||
+		 sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+		 sync_method == SYNC_METHOD_FDATASYNC))
+	{
+		/* Measure I/O timing to sync the WAL file */
+		if (track_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
+		issue_fsync = true;
+	}
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10589,6 +10627,25 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/*
+	 * Increment the I/O timing and the number of times WAL files were synced.
+	 *
+	 * Check whether the WAL file was synced to disk right now because
+	 * statistics must be incremented when syncing really occurred.
+	 */
+	if (issue_fsync)
+	{
+		if (track_wal_io_timing)
+		{
+			instr_time	duration;
+
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
+		}
+		WalStats.m_wal_sync++;
+	}
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd9d7..b8ace4fc41 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1004,6 +1004,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 54a818bf61..7f0996ce3c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -504,7 +504,7 @@ CheckpointerMain(void)
 		 */
 		pgstat_send_bgwriter();
 
-		/* Send WAL statistics to the stats collector. */
+		/* Send WAL statistics to the stats collector */
 		pgstat_send_wal();
 
 		/*
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..987bbd058d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -6892,6 +6892,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..104cba4581 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,6 +253,9 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/* Send WAL statistics */
+		pgstat_send_wal();
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..7296ef04ff 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,17 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+
+	/* convert counter from microsec to millisec for display */
+	values[5] = Float8GetDatum((double) wal_stats->wal_write_time / 1000.0);
+
+	values[6] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* convert counter from microsec to millisec for display */
+	values[7] = Float8GetDatum((double) wal_stats->wal_sync_time / 1000.0);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8735e36174..9a58922b8f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index bd57e917e1..a4c1d6650c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -585,6 +585,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ada55e7ad5..6962ffeef2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5543,9 +5543,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,float8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_write_time,wal_sync,wal_sync_time,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..000bb14c0b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,12 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_write_time;	/* time spend writing wal records in
+										 * micro seconds */
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
+									 * seconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +845,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6173473de9..bc3909fd17 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2158,8 +2158,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_write_time,
+    w.wal_sync,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_write_time, wal_sync, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,
-- 
2.25.1

v9-0002-Makes-the-wal-receiver-report-WAL-statistics.patchtext/x-diff; name=v9-0002-Makes-the-wal-receiver-report-WAL-statistics.patchDownload

From 0914ef57c75e68a391ff7330e4d9bafaffec35e8 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Date: Fri, 29 Jan 2021 16:46:30 +0900
Subject: [PATCH 2/2] Makes the wal receiver report WAL statistics

This patch makes the WAL receiver report WAL statistics
and fundamentally changes how the stas collector's behaves
with regards to that function and how it interacts with
the WAL receiver.

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David Johnston
Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com
---
 doc/src/sgml/monitoring.sgml          |  3 +-
 src/backend/replication/walreceiver.c | 40 +++++++++++++++++++++++++++
 2 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2435f401db..da48e6f946 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3494,7 +3494,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       <para>
        Number of times WAL buffers were written out to disk via 
        <function>XLogWrite</function>, which nomally called by an 
-       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>)
+       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>),
+       or WAL data written out to disk by WAL receiver.
       </para></entry>
      </row>
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index eaf5ec9a72..73435b616c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -485,7 +485,11 @@ WalReceiverMain(void)
 
 				/* Check if we need to exit the streaming loop. */
 				if (endofwal)
+				{
+					/* Send WAL statistics to the stats collector */
+					pgstat_send_wal();
 					break;
+				}
 
 				/*
 				 * Ideally we would reuse a WaitEventSet object repeatedly
@@ -550,8 +554,16 @@ WalReceiverMain(void)
 														wal_receiver_timeout);
 
 						if (now >= timeout)
+						{
+							/*
+							 * Send WAL statistics to the stats collector
+							 * before terminating
+							 */
+							pgstat_send_wal();
+
 							ereport(ERROR,
 									(errmsg("terminating walreceiver due to timeout")));
+						}
 
 						/*
 						 * We didn't receive anything new, for half of
@@ -874,6 +886,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	while (nbytes > 0)
 	{
 		int			segbytes;
+		instr_time	start;
 
 		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
 		{
@@ -910,6 +923,13 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveForceDone(xlogfname);
 				else
 					XLogArchiveNotify(xlogfname);
+
+				/*
+				 * Send WAL statistics to the stats collector when finishing
+				 * the current WAL segment file to avoid overloading it.
+				 */
+				pgstat_send_wal();
+
 			}
 			recvFile = -1;
 
@@ -931,7 +951,27 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		/* OK to write the logs */
 		errno = 0;
 
+		/* Measure I/O timing to write WAL data */
+		if (track_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
 		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+
+		/*
+		 * Increment the I/O timing and the number of times WAL data were
+		 * written out to disk.
+		 */
+		if (track_wal_io_timing)
+		{
+			instr_time	duration;
+
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
+		}
+
+		WalStats.m_wal_write++;
+
 		if (byteswritten <= 0)
 		{
 			char		xlogfname[MAXFNAMELEN];
-- 
2.25.1

#29

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#28)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/02/05 8:45, Masahiro Ikeda wrote:

I pgindented the patches.

Thanks for updating the patches!

+       <function>XLogWrite</function>, which nomally called by an
+       <function>issue_xlog_fsync</function>, which nomally called by an

Typo: "nomally" should be "normally"?

+       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>)
+       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>),

Isn't it better to add a space character just after "request"?

+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);

If several cycles happen in the do-while loop, m_wal_write_time should be
updated with the sum of "duration" in those cycles instead of "duration"
in the last cycle? If yes, "+=" should be used instead of "=" when updating
m_wal_write_time?

+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);

Also "=" should be "+=" in the above?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#30

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#29)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/02/08 13:01, Fujii Masao wrote:

On 2021/02/05 8:45, Masahiro Ikeda wrote:

I pgindented the patches.

Thanks for updating the patches!

+ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ <function>XLogWrite</function>, which nomally called by an
+ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ <function>issue_xlog_fsync</function>, which nomally called by an

Typo: "nomally" should be "normally"?

+ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>)
+ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>),

Isn't it better to add a space character just after "request"?

+ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ INSTR_TIME_SET_CURRENT(duration);
+ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ INSTR_TIME_SUBTRACT(duration, start);
+ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);

+ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ INSTR_TIME_SET_CURRENT(duration);
+ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ INSTR_TIME_SUBTRACT(duration, start);
+ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);

Also "=" should be "+=" in the above?

+ /* Send WAL statistics */
+ pgstat_send_wal();

This may cause overhead in WAL-writing by walwriter because it's called
every cycles even when walwriter needs to write more WAL next cycle
(don't need to sleep on WaitLatch)? If this is right, pgstat_send_wal()
should be called only when WaitLatch() returns with WL_TIMEOUT?

-       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>)
+       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>),
+       or WAL data written out to disk by WAL receiver.

So regarding walreceiver, only wal_write, wal_write_time, wal_sync, and
wal_sync_time are updated even while the other values are not. Isn't this
confusing to users? If so, what about reporting those walreceiver stats in
pg_stat_wal_receiver?

  				if (endofwal)
+				{
+					/* Send WAL statistics to the stats collector */
+					pgstat_send_wal();
  					break;

You added pgstat_send_wal() so that it's called in some cases where
walreceiver exits. But ISTM that there are other walreceiver-exit cases.
For example, in the case where SIGTERM is received. Instead,
pgstat_send_wal() should be called in WalRcvDie() for those all cases?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#31

David G. Johnston

david.g.johnston@gmail.com

almost 5 years ago

In reply to: Masahiro Ikeda (#28)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com>
wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also incremented
by the WAL receiver during replication.

("which normally called" should be "which is normally called" or "which
normally is called" if you want to keep true to the original)

You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either

"This parameter is off by default as it will repeatedly query the operating
system..."
", because" -> "as"

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe that the
XLogWrite code path calls pgstat_report_wait_*() while the WAL receiver
path does not. It seems technically straight-forward to refactor here to
avoid the almost-duplicated logic in the two places, though I suspect there
may be a trade-off for not adding another function call to the stack given
the importance of WAL processing (though that seems marginalized compared
to the cost of actually writing the WAL). Or, as Fujii noted, go the other
way and don't have any shared code between the two but instead implement
the WAL receiver one to use pg_stat_wal_receiver instead. In either case,
this half-and-half implementation seems undesirable.

David J.

#32

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#29)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-02-08 13:01, Fujii Masao wrote:

On 2021/02/05 8:45, Masahiro Ikeda wrote:

I pgindented the patches.

Thanks for updating the patches!

Thanks for checking the patches.

+       <function>XLogWrite</function>, which nomally called by an
+       <function>issue_xlog_fsync</function>, which nomally called by 
an

Typo: "nomally" should be "normally"?

Yes, I'll fix it.

+       <function>XLogFlush</function> request(see <xref
linkend="wal-configuration"/>)
+       <function>XLogFlush</function> request(see <xref
linkend="wal-configuration"/>),

Isn't it better to add a space character just after "request"?

Thanks, I'll fix it.

+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);

If several cycles happen in the do-while loop, m_wal_write_time should 
be
updated with the sum of "duration" in those cycles instead of 
"duration"
in the last cycle? If yes, "+=" should be used instead of "=" when 
updating
m_wal_write_time?
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);

Also "=" should be "+=" in the above?

Yes, they are my mistake when changing the unit from milliseconds to
microseconds.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#33

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#30)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-02-08 14:26, Fujii Masao wrote:

On 2021/02/08 13:01, Fujii Masao wrote:
On 2021/02/05 8:45, Masahiro Ikeda wrote:

I pgindented the patches.

Thanks for updating the patches!
+       <function>XLogWrite</function>, which nomally called by an
+       <function>issue_xlog_fsync</function>, which nomally called by 
an
Typo: "nomally" should be "normally"?
+       <function>XLogFlush</function> request(see <xref 
linkend="wal-configuration"/>)
+       <function>XLogFlush</function> request(see <xref 
linkend="wal-configuration"/>),
Isn't it better to add a space character just after "request"?
+                    INSTR_TIME_SET_CURRENT(duration);
+                    INSTR_TIME_SUBTRACT(duration, start);
+                    WalStats.m_wal_write_time = 
INSTR_TIME_GET_MICROSEC(duration);
If several cycles happen in the do-while loop, m_wal_write_time should
be
updated with the sum of "duration" in those cycles instead of
"duration"
in the last cycle? If yes, "+=" should be used instead of "=" when
updating
m_wal_write_time?
+            INSTR_TIME_SET_CURRENT(duration);
+            INSTR_TIME_SUBTRACT(duration, start);
+            WalStats.m_wal_sync_time = 
INSTR_TIME_GET_MICROSEC(duration);
Also "=" should be "+=" in the above?
+ /* Send WAL statistics */
+ pgstat_send_wal();

This may cause overhead in WAL-writing by walwriter because it's called
every cycles even when walwriter needs to write more WAL next cycle
(don't need to sleep on WaitLatch)? If this is right, pgstat_send_wal()
should be called only when WaitLatch() returns with WL_TIMEOUT?

Thanks, I didn't notice that.
I'll fix it.

-       <function>XLogFlush</function> request(see <xref
linkend="wal-configuration"/>)
+       <function>XLogFlush</function> request(see <xref
linkend="wal-configuration"/>),
+       or WAL data written out to disk by WAL receiver.
So regarding walreceiver, only wal_write, wal_write_time, wal_sync, and
wal_sync_time are updated even while the other values are not. Isn't
this
confusing to users? If so, what about reporting those walreceiver stats
in
pg_stat_wal_receiver?

OK, I'll add new infrastructure code to interect with wal receiver
and stats collector and show the stats in pg_stat_wal_receiver.

if (endofwal)
+				{
+					/* Send WAL statistics to the stats collector */
+					pgstat_send_wal();
break;
You added pgstat_send_wal() so that it's called in some cases where
walreceiver exits. But ISTM that there are other walreceiver-exit
cases.
For example, in the case where SIGTERM is received. Instead,
pgstat_send_wal() should be called in WalRcvDie() for those all cases?

Thanks, I forgot the case.
I'll fix it.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#34

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: David G. Johnston (#31)

2 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

I added the infrastructure code to communicate the WAL receiver stats
messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v10-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; name=v10-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload

From 4ecbc467f88a2f923b3bc2eb6fe2c4f7725c02be Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Date: Fri, 12 Feb 2021 11:19:59 +0900
Subject: [PATCH 1/2] Add statistics related to write/sync wal records.

This patch adds following statistics to pg_stat_wal view
to track WAL I/O activity by backends and background processes
except WAL receiver.

- the total number of times writing/syncing WAL data.
- the total amount of time spent writing/syncing WAL data.

Since to track I/O timing may leads significant overhead,
GUC parameter "track_wal_io_timing" is introduced.
Only if this is on, the I/O timing is measured.

The statistics related to sync are zero when "wal_sync_method"
is "open_datasync" or "open_sync", because it doesn't call each
sync method.

(This requires a catversion bump, as well as an update to
 PGSTAT_FILE_FORMAT_ID)

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David
Johnston, Fujii Masao
Discussion:
https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.co
---
 doc/src/sgml/config.sgml                      | 23 +++++++-
 doc/src/sgml/monitoring.sgml                  | 50 ++++++++++++++++
 doc/src/sgml/wal.sgml                         | 12 +++-
 src/backend/access/transam/xlog.c             | 57 +++++++++++++++++++
 src/backend/catalog/system_views.sql          |  4 ++
 src/backend/postmaster/pgstat.c               |  4 ++
 src/backend/postmaster/walwriter.c            | 16 ++++--
 src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++-
 src/backend/utils/misc/guc.c                  |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h                     |  1 +
 src/include/catalog/pg_proc.dat               |  6 +-
 src/include/pgstat.h                          | 10 ++++
 src/test/regress/expected/rules.out           |  6 +-
 14 files changed, 208 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5ef1c7ad3c..0bb03d0717 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7404,7 +7404,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Enables timing of database I/O calls.  This parameter is off by
-        default, because it will repeatedly query the operating system for
+        default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
         measure the overhead of timing on your system.
@@ -7418,6 +7418,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        as it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c602ee4427..8617449977 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3487,6 +3487,56 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent writing WAL data to disk, excluding sync time unless 
+       <xref linkend="guc-wal-sync-method"/> is either <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>. Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL files were synced to disk via 
+       <function>issue_xlog_fsync</function>, which normally called by an 
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>),
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the 
+       "sync at commit" options (i.e., <literal>fdatasync</literal>, 
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent syncing WAL files to disk, in milliseconds with microsecond 
+       resolution. This requires setting <xref linkend="guc-wal-sync-method"/> to one of 
+       the "sync at commit" options (i.e., <literal>fdatasync</literal>, <literal>fsync</literal>,
+       or <literal>fsync_writethrough</literal>).
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 66de1ee2f8..bca0180d73 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -663,7 +663,9 @@
    the <acronym>WAL</acronym> buffers in shared memory. If there is no
    space for the new record, <function>XLogInsertRecord</function> will have
    to write (move to kernel cache) a few filled <acronym>WAL</acronym>
-   buffers. This is undesirable because <function>XLogInsertRecord</function>
+   buffers (the tally of this event is reported in 
+   <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>).
+   This is undesirable because <function>XLogInsertRecord</function> 
    is used on every database low level modification (for example, row
    insertion) at a time when an exclusive lock is held on affected
    data pages, so the operation needs to be as fast as possible.  What
@@ -672,8 +674,12 @@
    time. Normally, <acronym>WAL</acronym> buffers should be written
    and flushed by an <function>XLogFlush</function> request, which is
    made, for the most part, at transaction commit time to ensure that
-   transaction records are flushed to permanent storage. On systems
-   with high log output, <function>XLogFlush</function> requests might
+   transaction records are flushed to permanent storage. 
+   <function>XLogFlush</function> calls <function>XLogWrite</function> to write 
+   and <function>issue_xlog_fsync</function> to flush them, which are counted as 
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in 
+   <xref linkend="pg-stat-wal-view"/>. On systems with high log output, 
+   <function>XLogFlush</function> requests might
    not occur often enough to prevent <function>XLogInsertRecord</function>
    from having to do writes.  On such systems
    one should increase the number of <acronym>WAL</acronym> buffers by
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9e03ca842d..904018ed46 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2536,6 +2537,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2544,9 +2546,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure I/O timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				/*
+				 * Increment the I/O timing and the number of times WAL data
+				 * were written out to disk.
+				 */
+				if (track_wal_io_timing)
+				{
+					instr_time	duration;
+
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10544,6 +10567,21 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
 	char	   *msg = NULL;
+	bool		issue_fsync = false;
+	instr_time	start;
+
+	/* Check whether the WAL file was synced to disk right now */
+	if (enableFsync &&
+		(sync_method == SYNC_METHOD_FSYNC ||
+		 sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+		 sync_method == SYNC_METHOD_FDATASYNC))
+	{
+		/* Measure I/O timing to sync the WAL file */
+		if (track_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
+		issue_fsync = true;
+	}
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10588,6 +10626,25 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/*
+	 * Increment the I/O timing and the number of times WAL files were synced.
+	 *
+	 * Check whether the WAL file was synced to disk right now because
+	 * statistics must be incremented when syncing really occurred.
+	 */
+	if (issue_fsync)
+	{
+		if (track_wal_io_timing)
+		{
+			instr_time	duration;
+
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
+		}
+		WalStats.m_wal_sync++;
+	}
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd9d7..b8ace4fc41 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1004,6 +1004,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..987bbd058d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -6892,6 +6892,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..08fa7032c0 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -223,6 +223,7 @@ WalWriterMain(void)
 	for (;;)
 	{
 		long		cur_timeout;
+		int			rc;
 
 		/*
 		 * Advertise whether we might hibernate in this cycle.  We do this
@@ -263,9 +264,16 @@ WalWriterMain(void)
 		else
 			cur_timeout = WalWriterDelay * HIBERNATE_FACTOR;
 
-		(void) WaitLatch(MyLatch,
-						 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-						 cur_timeout,
-						 WAIT_EVENT_WAL_WRITER_MAIN);
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					   cur_timeout,
+					   WAIT_EVENT_WAL_WRITER_MAIN);
+
+		/*
+		 * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+		 * the overhead in WAL-writing.
+		 */
+		if (rc & WL_TIMEOUT)
+			pgstat_send_wal();
 	}
 }
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..7296ef04ff 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,17 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+
+	/* convert counter from microsec to millisec for display */
+	values[5] = Float8GetDatum((double) wal_stats->wal_write_time / 1000.0);
+
+	values[6] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* convert counter from microsec to millisec for display */
+	values[7] = Float8GetDatum((double) wal_stats->wal_sync_time / 1000.0);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8735e36174..9a58922b8f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index bd57e917e1..a4c1d6650c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -585,6 +585,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ada55e7ad5..6962ffeef2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5543,9 +5543,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,float8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_write_time,wal_sync,wal_sync_time,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..000bb14c0b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,12 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_write_time;	/* time spend writing wal records in
+										 * micro seconds */
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
+									 * seconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +845,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b632d9f2ea..2ad074f6a0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2158,8 +2158,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_write_time,
+    w.wal_sync,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_write_time, wal_sync, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,
-- 
2.25.1

v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patchtext/x-diff; name=v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patchDownload

From f2d721d5bedcb916e12331fff316712af3176e0f Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Date: Mon, 15 Feb 2021 10:26:01 +0900
Subject: [PATCH 2/2] Makes the wal receiver report WAL statistics

This patch makes the WAL receiver report WAL statistics
and fundamentally changes how the stats collector's behaves
with regards to that function and how it interacts with
the WAL receiver.

(This requires a catversion bump, as well as an update to
 PGSTAT_FILE_FORMAT_ID)

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David Johnston,
Fujii Masao
Discussion:
https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com
---
 doc/src/sgml/monitoring.sgml          |  86 +++++++++++++++++--
 src/backend/access/transam/xlog.c     |  12 ++-
 src/backend/catalog/system_views.sql  |   7 +-
 src/backend/postmaster/pgstat.c       | 114 +++++++++++++++++++++++++-
 src/backend/replication/walreceiver.c |  46 +++++++++++
 src/include/catalog/pg_proc.dat       |   6 +-
 src/include/pgstat.h                  |  40 ++++++++-
 src/test/regress/expected/rules.out   |   9 +-
 8 files changed, 304 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8617449977..731db6d29b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2862,6 +2862,62 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        with security-sensitive fields obfuscated.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL data were written out to disk by WAL receiver.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent writing WAL data to disk by WAL receiver,
+       excluding sync time unless <xref linkend="guc-wal-sync-method"/> is either
+       <literal>open_datasync</literal> or <literal>open_sync</literal>. 
+       Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL files were synced to disk by WAL receiver.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent syncing WAL files to disk by WAL receiver, 
+       in milliseconds with microsecond resolution. This requires setting 
+       <xref linkend="guc-wal-sync-method"/> to one of the "sync at commit" 
+       options (i.e., <literal>fdatasync</literal>, <literal>fsync</literal>,
+       or <literal>fsync_writethrough</literal>).
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics counters (i.e. <literal>wal_write</literal>,
+       <literal>wal_write_time</literal>, <literal>wal_sync</literal>, and
+       <literal>wal_sync_time</literal> ) were last reset.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
@@ -3492,9 +3548,13 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>wal_write</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of times WAL buffers were written out to disk via
+       Number of times WAL buffers were written out to disk by backends and 
+       background processes except WAL receiver via
        <function>XLogWrite</function>, which is invoked during an
        <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>).
+       The same statistics for WAL receiver is counted in
+       <link linkend="monitoring-pg-stat-wal-receiver-view">
+       <structname>pg_stat_wal_receiver</structname></link>.
       </para></entry>
      </row>
 
@@ -3503,10 +3563,14 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>wal_write_time</structfield> <type>double precision</type>
       </para>
       <para>
-       Total amount of time spent writing WAL data to disk, excluding sync time unless 
+       Total amount of time spent writing WAL data to disk by backends and 
+       background processes except WAL receiver, excluding sync time unless 
        <xref linkend="guc-wal-sync-method"/> is either <literal>open_datasync</literal> or 
        <literal>open_sync</literal>. Units are in milliseconds with microsecond resolution.
        This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+       The same statistics for WAL receiver is counted in
+       <link linkend="monitoring-pg-stat-wal-receiver-view">
+       <structname>pg_stat_wal_receiver</structname></link>.
       </para></entry>
      </row>
 
@@ -3515,12 +3579,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>wal_sync</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of times WAL files were synced to disk via 
+       Number of times WAL files were synced to disk by backends and 
+       background processes except WAL receiver, via 
        <function>issue_xlog_fsync</function>, which normally called by an 
        <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>),
        while <xref linkend="guc-wal-sync-method"/> was set to one of the 
        "sync at commit" options (i.e., <literal>fdatasync</literal>, 
        <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+       The same statistics for WAL receiver is counted in
+       <link linkend="monitoring-pg-stat-wal-receiver-view">
+       <structname>pg_stat_wal_receiver</structname></link>.
       </para></entry>
      </row>
 
@@ -3529,11 +3597,15 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>wal_sync_time</structfield> <type>double precision</type>
       </para>
       <para>
-       Total amount of time spent syncing WAL files to disk, in milliseconds with microsecond 
+       Total amount of time spent syncing WAL files to disk by backends and 
+       background processes except WAL receiver, in milliseconds with microsecond 
        resolution. This requires setting <xref linkend="guc-wal-sync-method"/> to one of 
        the "sync at commit" options (i.e., <literal>fdatasync</literal>, <literal>fsync</literal>,
        or <literal>fsync_writethrough</literal>).
        This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+       The same statistics for WAL receiver is counted in
+       <link linkend="monitoring-pg-stat-wal-receiver-view">
+       <structname>pg_stat_wal_receiver</structname></link>.
       </para></entry>
      </row>
 
@@ -5017,7 +5089,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
+        the <structname>pg_stat_archiver</structname> view, <literal>walreceiver</literal>
+        to reset all the counters (i.e. <literal>wal_write</literal>,
+        <literal>wal_write_time</literal>, <literal>wal_sync</literal>, and
+        <literal>wal_sync_time</literal> ) shown in the 
+        <structname>pg_stat_wal_receiver</structname> view, or <literal>wal</literal>
         to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
        </para>
        <para>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 904018ed46..ba467bf41c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10641,9 +10641,17 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 
 			INSTR_TIME_SET_CURRENT(duration);
 			INSTR_TIME_SUBTRACT(duration, start);
-			WalStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
+
+			if (AmWalReceiverProcess())
+				WalReceiverStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
+			else
+				WalStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
 		}
-		WalStats.m_wal_sync++;
+
+		if (AmWalReceiverProcess())
+			WalReceiverStats.m_wal_sync++;
+		else
+			WalStats.m_wal_sync++;
 	}
 }
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b8ace4fc41..cf4e1f6355 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -837,7 +837,12 @@ CREATE VIEW pg_stat_wal_receiver AS
             s.slot_name,
             s.sender_host,
             s.sender_port,
-            s.conninfo
+            s.conninfo,
+            s.wal_write,
+            s.wal_write_time,
+            s.wal_sync,
+            s.wal_sync_time,
+            s.stats_reset
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 987bbd058d..1a98d1ac53 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -137,11 +137,12 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
+ * BgWriter, WAL receiver and WAL global statistics counters.
  * Stored directly in a stats message structure so they can be sent
  * without needing to copy things around.  We assume these init to zeroes.
  */
 PgStat_MsgBgWriter BgWriterStats;
+PgStat_MsgWalReceiver WalReceiverStats;
 PgStat_MsgWal WalStats;
 
 /*
@@ -295,6 +296,7 @@ static int	localNumBackends = 0;
  */
 static PgStat_ArchiverStats archiverStats;
 static PgStat_GlobalStats globalStats;
+static PgStat_WalReceiverStats walReceiverStats;
 static PgStat_WalStats walStats;
 static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
 static PgStat_ReplSlotStats *replSlotStats;
@@ -375,6 +377,7 @@ static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
+static void pgstat_recv_walreceiver(PgStat_MsgWalReceiver * msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -1450,13 +1453,15 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "walreceiver") == 0)
+		msg.m_resettarget = RESET_WALRECEIVER;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\", \"walreceiver\" or \"wal\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
@@ -2852,6 +2857,22 @@ pgstat_fetch_global(void)
 	return &globalStats;
 }
 
+/*
+ * ---------
+ * pgstat_fetch_stat_walreceiver() -
+ *
+ *	Support function for the SQL-callable pgstat* functions. Returns
+ *	a pointer to the WAL receiver statistics struct.
+ * ---------
+ */
+PgStat_WalReceiverStats *
+pgstat_fetch_stat_walreceiver(void)
+{
+	backend_read_statsfile();
+
+	return &walReceiverStats;
+}
+
 /*
  * ---------
  * pgstat_fetch_stat_wal() -
@@ -4666,6 +4687,39 @@ pgstat_send_bgwriter(void)
 	MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
+/* ----------
+ * pgstat_send_walreceiver() -
+ *
+ *		Send wal receiver statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_walreceiver(void)
+{
+	/* We assume this initializes to zeroes */
+	static const PgStat_MsgWalReceiver all_zeroes;
+
+	/*
+	 * This function can be called even if nothing at all has happened. In
+	 * this case, avoid sending a completely empty message to the stats
+	 * collector.
+	 */
+	if (memcmp(&WalReceiverStats, &all_zeroes, sizeof(PgStat_MsgWalReceiver)) == 0)
+		return;
+
+	/*
+	 * Prepare and send the message
+	 */
+	pgstat_setheader(&WalReceiverStats.m_hdr, PGSTAT_MTYPE_WALRECEIVER);
+	pgstat_send(&WalReceiverStats, sizeof(WalReceiverStats));
+
+	/*
+	 * Clear out the statistics buffer, so it can be re-used.
+	 */
+	MemSet(&WalReceiverStats, 0, sizeof(WalReceiverStats));
+}
+
+
 /* ----------
  * pgstat_send_wal() -
  *
@@ -4961,6 +5015,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
 					break;
 
+				case PGSTAT_MTYPE_WALRECEIVER:
+					pgstat_recv_walreceiver(&msg.msg_walreceiver, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5249,6 +5307,12 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 	rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
 	(void) rc;					/* we'll check for error with ferror */
 
+	/*
+	 * Write WAL receiver stats struct
+	 */
+	rc = fwrite(&walReceiverStats, sizeof(walReceiverStats), 1, fpout);
+	(void) rc;					/* we'll check for error with ferror */
+
 	/*
 	 * Write WAL stats struct
 	 */
@@ -5532,6 +5596,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	 */
 	memset(&globalStats, 0, sizeof(globalStats));
 	memset(&archiverStats, 0, sizeof(archiverStats));
+	memset(&walReceiverStats, 0, sizeof(walReceiverStats));
 	memset(&walStats, 0, sizeof(walStats));
 	memset(&slruStats, 0, sizeof(slruStats));
 
@@ -5541,6 +5606,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	 */
 	globalStats.stat_reset_timestamp = GetCurrentTimestamp();
 	archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+	walReceiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
 	walStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
 
 	/*
@@ -5617,6 +5683,17 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 		goto done;
 	}
 
+	/*
+	 * Read WAL receiver stats struct
+	 */
+	if (fread(&walReceiverStats, 1, sizeof(walReceiverStats), fpin) != sizeof(walReceiverStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		memset(&walReceiverStats, 0, sizeof(walReceiverStats));
+		goto done;
+	}
+
 	/*
 	 * Read WAL stats struct
 	 */
@@ -5954,6 +6031,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 	PgStat_StatDBEntry dbentry;
 	PgStat_GlobalStats myGlobalStats;
 	PgStat_ArchiverStats myArchiverStats;
+	PgStat_WalReceiverStats myWalReceiverStats;
 	PgStat_WalStats myWalStats;
 	PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
 	PgStat_ReplSlotStats myReplSlotStats;
@@ -6011,6 +6089,17 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 		return false;
 	}
 
+	/*
+	 * Read WAL receiver stats struct
+	 */
+	if (fread(&myWalReceiverStats, 1, sizeof(myWalReceiverStats), fpin) != sizeof(myWalReceiverStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		FreeFile(fpin);
+		return false;
+	}
+
 	/*
 	 * Read WAL stats struct
 	 */
@@ -6619,6 +6708,12 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 		memset(&archiverStats, 0, sizeof(archiverStats));
 		archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_WALRECEIVER)
+	{
+		/* Reset the WAL receiver statistics for the cluster. */
+		memset(&walReceiverStats, 0, sizeof(walReceiverStats));
+		walReceiverStats.stat_reset_timestamp = GetCurrentTimestamp();
+	}
 	else if (msg->m_resettarget == RESET_WAL)
 	{
 		/* Reset the WAL statistics for the cluster. */
@@ -6879,6 +6974,21 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.buf_alloc += msg->m_buf_alloc;
 }
 
+/* ----------
+ * pgstat_recv_walreceiver() -
+ *
+ *	Process a WALRECEIVER message.
+ * ----------
+ */
+static void
+pgstat_recv_walreceiver(PgStat_MsgWalReceiver * msg, int len)
+{
+	walReceiverStats.wal_write += msg->m_wal_write;
+	walReceiverStats.wal_write_time += msg->m_wal_write_time;
+	walReceiverStats.wal_sync += msg->m_wal_sync;
+	walReceiverStats.wal_sync_time += msg->m_wal_sync_time;
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index eaf5ec9a72..f04e1d99e7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -773,6 +773,9 @@ WalRcvDie(int code, Datum arg)
 	/* Ensure that all WAL records received are flushed to disk */
 	XLogWalRcvFlush(true);
 
+	/* Send WAL receiver statistics to the stats collector before terminating */
+	pgstat_send_walreceiver();
+
 	/* Mark ourselves inactive in shared memory */
 	SpinLockAcquire(&walrcv->mutex);
 	Assert(walrcv->walRcvState == WALRCV_STREAMING ||
@@ -874,6 +877,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	while (nbytes > 0)
 	{
 		int			segbytes;
+		instr_time	start;
 
 		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
 		{
@@ -910,6 +914,13 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveForceDone(xlogfname);
 				else
 					XLogArchiveNotify(xlogfname);
+
+				/*
+				 * Send WAL receiver statistics to the stats collector when
+				 * finishing the current WAL segment file to avoid overloading
+				 * it.
+				 */
+				pgstat_send_walreceiver();
 			}
 			recvFile = -1;
 
@@ -931,7 +942,27 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		/* OK to write the logs */
 		errno = 0;
 
+		/* Measure I/O timing to write WAL data */
+		if (track_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
 		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+
+		/*
+		 * Increment the I/O timing and the number of times WAL data were
+		 * written out to disk.
+		 */
+		if (track_wal_io_timing)
+		{
+			instr_time	duration;
+
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalReceiverStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+		}
+
+		WalReceiverStats.m_wal_write++;
+
 		if (byteswritten <= 0)
 		{
 			char		xlogfname[MAXFNAMELEN];
@@ -1317,6 +1348,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	int			sender_port = 0;
 	char		slotname[NAMEDATALEN];
 	char		conninfo[MAXCONNINFO];
+	PgStat_WalReceiverStats *walreceiver_stats;
 
 	/* Take a lock to ensure value consistency */
 	SpinLockAcquire(&WalRcv->mutex);
@@ -1338,6 +1370,9 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	strlcpy(conninfo, (char *) WalRcv->conninfo, sizeof(conninfo));
 	SpinLockRelease(&WalRcv->mutex);
 
+	/* Get statistics about WAL receiver */
+	walreceiver_stats = pgstat_fetch_stat_walreceiver();
+
 	/*
 	 * No WAL receiver (or not ready yet), just return a tuple with NULL
 	 * values
@@ -1414,6 +1449,17 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 			nulls[14] = true;
 		else
 			values[14] = CStringGetTextDatum(conninfo);
+
+		/* returns WAL I/O activity */
+		values[15] = Int64GetDatum(walreceiver_stats->wal_write);
+
+		/* convert counter from microsec to millisec for display */
+		values[16] = Float8GetDatum((double) walreceiver_stats->wal_write_time / 1000.0);
+		values[17] = Int64GetDatum(walreceiver_stats->wal_sync);
+
+		/* convert counter from microsec to millisec for display */
+		values[18] = Float8GetDatum((double) walreceiver_stats->wal_sync_time / 1000.0);
+		values[19] = TimestampTzGetDatum(walreceiver_stats->stat_reset_timestamp);
 	}
 
 	/* Returns the record as Datum */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6962ffeef2..f04ee4a434 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5272,9 +5272,9 @@
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,int4,pg_lsn,pg_lsn,int4,timestamptz,timestamptz,pg_lsn,timestamptz,text,text,int4,text}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,status,receive_start_lsn,receive_start_tli,written_lsn,flushed_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,sender_host,sender_port,conninfo}',
+  proallargtypes => '{int4,text,pg_lsn,int4,pg_lsn,pg_lsn,int4,timestamptz,timestamptz,pg_lsn,timestamptz,text,text,int4,text,int8,float8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,status,receive_start_lsn,receive_start_tli,written_lsn,flushed_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,sender_host,sender_port,conninfo,wal_write,wal_write_time,wal_sync,wal_sync_time,stats_reset}',
   prosrc => 'pg_stat_get_wal_receiver' },
 { oid => '8595', descr => 'statistics: information about replication slots',
   proname => 'pg_stat_get_replication_slots', prorows => '10',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 000bb14c0b..fb6bf3282f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -72,6 +72,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ANALYZE,
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
+	PGSTAT_MTYPE_WALRECEIVER,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -137,6 +138,7 @@ typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
 	RESET_BGWRITER,
+	RESET_WALRECEIVER,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -463,6 +465,21 @@ typedef struct PgStat_MsgBgWriter
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
 
+/* ----------
+ * PgStat_MsgWalReceiver			Sent by wal receiver to update statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgWalReceiver
+{
+	PgStat_MsgHdr m_hdr;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_write_time;	/* time spend writing wal records in
+										 * micro seconds */
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
+									 * seconds */
+}			PgStat_MsgWalReceiver;
+
 /* ----------
  * PgStat_MsgWal			Sent by backends and background processes to update WAL statistics.
  * ----------
@@ -677,6 +694,7 @@ typedef union PgStat_Msg
 	PgStat_MsgAnalyze msg_analyze;
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
+	PgStat_MsgWalReceiver msg_walreceiver;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -698,7 +716,7 @@ typedef union PgStat_Msg
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCA0
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCA1
 
 /* ----------
  * PgStat_StatDBEntry			The collector's data per database
@@ -836,6 +854,18 @@ typedef struct PgStat_GlobalStats
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
+/*
+ * WAL receiver statistics kept in the stats collector
+ */
+typedef struct PgStat_WalReceiverStats
+{
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_sync_time;
+	TimestampTz stat_reset_timestamp;
+}			PgStat_WalReceiverStats;
+
 /*
  * WAL statistics kept in the stats collector
  */
@@ -1387,8 +1417,14 @@ extern char *pgstat_stat_filename;
  */
 extern PgStat_MsgBgWriter BgWriterStats;
 
+/*
+ * WAL receiver statistics counter is updated by wal receiver
+ */
+extern PgStat_MsgWalReceiver WalReceiverStats;
+
 /*
  * WAL statistics counter is updated by backends and background processes
+ * excepting wal receiver because it's counted via WalReceiverStats.
  */
 extern PgStat_MsgWal WalStats;
 
@@ -1600,6 +1636,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_walreceiver(void);
 extern void pgstat_send_wal(void);
 
 /* ----------
@@ -1615,6 +1652,7 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int	pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern PgStat_WalReceiverStats * pgstat_fetch_stat_walreceiver(void);
 extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
 extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2ad074f6a0..1bf8e06347 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2178,8 +2178,13 @@ pg_stat_wal_receiver| SELECT s.pid,
     s.slot_name,
     s.sender_host,
     s.sender_port,
-    s.conninfo
-   FROM pg_stat_get_wal_receiver() s(pid, status, receive_start_lsn, receive_start_tli, written_lsn, flushed_lsn, received_tli, last_msg_send_time, last_msg_receipt_time, latest_end_lsn, latest_end_time, slot_name, sender_host, sender_port, conninfo)
+    s.conninfo,
+    s.wal_write,
+    s.wal_write_time,
+    s.wal_sync,
+    s.wal_sync_time,
+    s.stats_reset
+   FROM pg_stat_get_wal_receiver() s(pid, status, receive_start_lsn, receive_start_tli, written_lsn, flushed_lsn, received_tli, last_msg_send_time, last_msg_receipt_time, latest_end_lsn, latest_end_time, slot_name, sender_host, sender_port, conninfo, wal_write, wal_write_time, wal_sync, wal_sync_time, stats_reset)
   WHERE (s.pid IS NOT NULL);
 pg_stat_xact_all_tables| SELECT c.oid AS relid,
     n.nspname AS schemaname,
-- 
2.25.1

#35

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#34)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...).ï¿½ This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics.ï¿½ This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite.ï¿½ Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not.ï¿½ It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL).ï¿½ Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead.ï¿½ In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#36

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#35)

2 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here
and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe
that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't
have
any shared code between the two but instead implement the WAL
receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver stats
messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view
in v11 patch.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

I refactored the logic to write xlog file to unify collecting the write
stats.
As David said, although pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE)
is not called in the WAL receiver's path,
I agreed that the cost to write the WAL is much bigger.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v11-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; name=v11-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload

From c97fff03cd2bd51b28f6c6fde56c48792683f44e Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Date: Fri, 12 Feb 2021 11:19:59 +0900
Subject: [PATCH 1/2] Add statistics related to write/sync wal records.

This patch adds following statistics to pg_stat_wal view
to track WAL I/O activity by backends and background processes
except WAL receiver.

- the total number of times writing/syncing WAL data.
- the total amount of time spent writing/syncing WAL data.

Since to track I/O timing may leads significant overhead,
GUC parameter "track_wal_io_timing" is introduced.
Only if this is on, the I/O timing is measured.

The statistics related to sync are zero when "wal_sync_method"
is "open_datasync" or "open_sync", because it doesn't call each
sync method.

(This requires a catversion bump, as well as an update to
 PGSTAT_FILE_FORMAT_ID)

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David
Johnston, Fujii Masao
Discussion:
https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.co
---
 doc/src/sgml/config.sgml                      | 23 +++++++-
 doc/src/sgml/monitoring.sgml                  | 56 ++++++++++++++++++
 doc/src/sgml/wal.sgml                         | 12 +++-
 src/backend/access/transam/xlog.c             | 57 +++++++++++++++++++
 src/backend/catalog/system_views.sql          |  4 ++
 src/backend/postmaster/pgstat.c               |  4 ++
 src/backend/postmaster/walwriter.c            | 16 ++++--
 src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++-
 src/backend/utils/misc/guc.c                  |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h                     |  1 +
 src/include/catalog/pg_proc.dat               |  6 +-
 src/include/pgstat.h                          | 10 ++++
 src/test/regress/expected/rules.out           |  6 +-
 14 files changed, 214 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b5718fc136..c232f537b2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7439,7 +7439,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Enables timing of database I/O calls.  This parameter is off by
-        default, because it will repeatedly query the operating system for
+        default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
         measure the overhead of timing on your system.
@@ -7453,6 +7453,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        as it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3513e127b7..a16be45a71 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3487,6 +3487,62 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent writing WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>),
+       excluding sync time unless 
+       <xref linkend="guc-wal-sync-method"/> is either <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>. Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL files were synced to disk via 
+       <function>issue_xlog_fsync</function>, which is invoked during an 
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the 
+       "sync at commit" options (i.e., <literal>fdatasync</literal>, 
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent syncing WAL files to disk via
+       <function>issue_xlog_fsync</function>, which is invoked during an 
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the 
+       "sync at commit" options (i.e., <literal>fdatasync</literal>, 
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+       Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f75527f764..06e4b37012 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -663,7 +663,9 @@
    the <acronym>WAL</acronym> buffers in shared memory. If there is no
    space for the new record, <function>XLogInsertRecord</function> will have
    to write (move to kernel cache) a few filled <acronym>WAL</acronym>
-   buffers. This is undesirable because <function>XLogInsertRecord</function>
+   buffers (the tally of this event is reported in 
+   <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>).
+   This is undesirable because <function>XLogInsertRecord</function> 
    is used on every database low level modification (for example, row
    insertion) at a time when an exclusive lock is held on affected
    data pages, so the operation needs to be as fast as possible.  What
@@ -672,8 +674,12 @@
    time. Normally, <acronym>WAL</acronym> buffers should be written
    and flushed by an <function>XLogFlush</function> request, which is
    made, for the most part, at transaction commit time to ensure that
-   transaction records are flushed to permanent storage. On systems
-   with high log output, <function>XLogFlush</function> requests might
+   transaction records are flushed to permanent storage. 
+   <function>XLogFlush</function> calls <function>XLogWrite</function> to write 
+   and <function>issue_xlog_fsync</function> to flush them, which are counted as 
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in 
+   <xref linkend="pg-stat-wal-view"/>. On systems with high log output, 
+   <function>XLogFlush</function> requests might
    not occur often enough to prevent <function>XLogInsertRecord</function>
    from having to do writes.  On such systems
    one should increase the number of <acronym>WAL</acronym> buffers by
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fe56324439..98f558b4c7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2533,6 +2534,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2541,9 +2543,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure I/O timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				/*
+				 * Increment the I/O timing and the number of times WAL data
+				 * were written out to disk.
+				 */
+				if (track_wal_io_timing)
+				{
+					instr_time	duration;
+
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10526,6 +10549,21 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
 	char	   *msg = NULL;
+	bool		issue_fsync = false;
+	instr_time	start;
+
+	/* Check whether the WAL file was synced to disk right now */
+	if (enableFsync &&
+		(sync_method == SYNC_METHOD_FSYNC ||
+		 sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+		 sync_method == SYNC_METHOD_FDATASYNC))
+	{
+		/* Measure I/O timing to sync the WAL file */
+		if (track_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
+		issue_fsync = true;
+	}
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10570,6 +10608,25 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/*
+	 * Increment the I/O timing and the number of times WAL files were synced.
+	 *
+	 * Check whether the WAL file was synced to disk right now because
+	 * statistics must be incremented when syncing really occurred.
+	 */
+	if (issue_fsync)
+	{
+		if (track_wal_io_timing)
+		{
+			instr_time	duration;
+
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
+		}
+		WalStats.m_wal_sync++;
+	}
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73a54..ba5158ba57 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1005,6 +1005,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..987bbd058d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -6892,6 +6892,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..08fa7032c0 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -223,6 +223,7 @@ WalWriterMain(void)
 	for (;;)
 	{
 		long		cur_timeout;
+		int			rc;
 
 		/*
 		 * Advertise whether we might hibernate in this cycle.  We do this
@@ -263,9 +264,16 @@ WalWriterMain(void)
 		else
 			cur_timeout = WalWriterDelay * HIBERNATE_FACTOR;
 
-		(void) WaitLatch(MyLatch,
-						 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-						 cur_timeout,
-						 WAIT_EVENT_WAL_WRITER_MAIN);
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					   cur_timeout,
+					   WAIT_EVENT_WAL_WRITER_MAIN);
+
+		/*
+		 * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+		 * the overhead in WAL-writing.
+		 */
+		if (rc & WL_TIMEOUT)
+			pgstat_send_wal();
 	}
 }
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..7296ef04ff 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,17 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+
+	/* convert counter from microsec to millisec for display */
+	values[5] = Float8GetDatum((double) wal_stats->wal_write_time / 1000.0);
+
+	values[6] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* convert counter from microsec to millisec for display */
+	values[7] = Float8GetDatum((double) wal_stats->wal_sync_time / 1000.0);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c5b7f01e1a..958e84a962 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee06528bb0..c6483fa1ff 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,6 +586,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 447f9ae44d..91ac0479c2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5543,9 +5543,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,float8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_write_time,wal_sync,wal_sync_time,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..000bb14c0b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,12 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_write_time;	/* time spend writing wal records in
+										 * micro seconds */
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
+									 * seconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +845,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b1c9b7bdfe..fce6ab8338 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2160,8 +2160,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_write_time,
+    w.wal_sync,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_write_time, wal_sync, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,
-- 
2.25.1

v11-0002-Makes-the-wal-receiver-report-WAL-statistics.patchtext/x-diff; name=v11-0002-Makes-the-wal-receiver-report-WAL-statistics.patchDownload

From e4e7df78c059b9c888544918789654ebdde3e2eb Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Date: Tue, 2 Mar 2021 16:36:17 +0900
Subject: [PATCH 2/2] Makes the wal receiver report WAL statistics

This patch makes the WAL receiver report WAL statistics.

- fundamentally changes how the stats collector interacts
  with the WAL receiver.

- unifying the logic to collect xlog write stats for the
  WAL receiver and the others to avoid duplicate logic.

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David Johnston,
Fujii Masao
Discussion:
https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com
---
 doc/src/sgml/monitoring.sgml          |  4 ++
 src/backend/access/transam/xlog.c     | 64 +++++++++++++++++----------
 src/backend/replication/walreceiver.c | 12 ++++-
 src/include/access/xlog.h             |  1 +
 4 files changed, 57 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a16be45a71..66525e184f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3495,6 +3495,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Number of times WAL buffers were written out to disk via
        <function>XLogWrite</function>, which is invoked during an
        <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+       , or WAL data written out to disk by the WAL receiver.
       </para></entry>
      </row>
 
@@ -3506,6 +3507,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Total amount of time spent writing WAL buffers were written out to disk via
        <function>XLogWrite</function>, which is invoked during an
        <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>),
+       or WAL data written out to disk by the WAL receiver, 
        excluding sync time unless 
        <xref linkend="guc-wal-sync-method"/> is either <literal>open_datasync</literal> or 
        <literal>open_sync</literal>. Units are in milliseconds with microsecond resolution.
@@ -3521,6 +3523,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Number of times WAL files were synced to disk via 
        <function>issue_xlog_fsync</function>, which is invoked during an 
        <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+       by backends and background processes including the WAL receiver
        while <xref linkend="guc-wal-sync-method"/> was set to one of the 
        "sync at commit" options (i.e., <literal>fdatasync</literal>, 
        <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
@@ -3535,6 +3538,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Total amount of time spent syncing WAL files to disk via
        <function>issue_xlog_fsync</function>, which is invoked during an 
        <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+       by backends and background processes including the WAL receiver
        while <xref linkend="guc-wal-sync-method"/> was set to one of the 
        "sync at commit" options (i.e., <literal>fdatasync</literal>, 
        <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 98f558b4c7..c220b6b776 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2534,7 +2534,6 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2544,28 +2543,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-
-				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-				written = pg_pwrite(openLogFile, from, nleft, startoffset);
-				pgstat_report_wait_end();
-
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	duration;
-
-					INSTR_TIME_SET_CURRENT(duration);
-					INSTR_TIME_SUBTRACT(duration, start);
-					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
-				}
-
-				WalStats.m_wal_write++;
+				written = XLogWriteFile(openLogFile, from, nleft, startoffset);
 
 				if (written <= 0)
 				{
@@ -2705,6 +2683,46 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	}
 }
 
+/*
+ * Issue pg_pwrite to write an XLOG file.
+ *
+ * 'fd' is a file descriptor for the XLOG file to write
+ * 'buf' is a buffer starting address to write.
+ * 'nbyte' is a number of max bytes to write up.
+ * 'offset' is a offset of XLOG file to be set.
+ */
+int
+XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset)
+{
+	int written;
+	instr_time	start;
+
+	/* Measure I/O timing to write WAL data */
+	if (track_wal_io_timing)
+		INSTR_TIME_SET_CURRENT(start);
+
+	pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+	written = pg_pwrite(fd, buf, nbyte, offset);
+	pgstat_report_wait_end();
+
+	/*
+	 * Increment the I/O timing and the number of times WAL data were
+	 * written out to disk.
+	 */
+	if (track_wal_io_timing)
+	{
+		instr_time	duration;
+
+		INSTR_TIME_SET_CURRENT(duration);
+		INSTR_TIME_SUBTRACT(duration, start);
+		WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+	}
+
+	WalStats.m_wal_write++;
+
+	return written;
+}
+
 /*
  * Record the LSN for an asynchronous transaction commit/abort
  * and nudge the WALWriter if there is work for it to do.
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7810ee916c..f9834b8302 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -770,6 +770,9 @@ WalRcvDie(int code, Datum arg)
 	/* Ensure that all WAL records received are flushed to disk */
 	XLogWalRcvFlush(true);
 
+	/* Send WAL statistics to the stats collector before terminating */
+	pgstat_send_wal();
+
 	/* Mark ourselves inactive in shared memory */
 	SpinLockAcquire(&walrcv->mutex);
 	Assert(walrcv->walRcvState == WALRCV_STREAMING ||
@@ -907,6 +910,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveForceDone(xlogfname);
 				else
 					XLogArchiveNotify(xlogfname);
+
+				/*
+				 * Send WAL statistics to the stats collector when finishing
+				 * the current WAL segment file to avoid overloading it.
+				 */
+				pgstat_send_wal();
 			}
 			recvFile = -1;
 
@@ -928,7 +937,8 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		/* OK to write the logs */
 		errno = 0;
 
-		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+		byteswritten = XLogWriteFile(recvFile, buf, segbytes, (off_t) startoff);
+
 		if (byteswritten <= 0)
 		{
 			char		xlogfname[MAXFNAMELEN];
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1e53d9d4ca..b345de8a28 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -290,6 +290,7 @@ extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
 extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
 extern int	XLogFileOpen(XLogSegNo segno);
+extern int	XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);
-- 
2.25.1

#37

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#36)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.

+	/* Check whether the WAL file was synced to disk right now */
+	if (enableFsync &&
+		(sync_method == SYNC_METHOD_FSYNC ||
+		 sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+		 sync_method == SYNC_METHOD_FDATASYNC))
+	{

Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?

+		/*
+		 * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+		 * the overhead in WAL-writing.
+		 */
+		if (rc & WL_TIMEOUT)
+			pgstat_send_wal();

On second thought, this change means that it always takes wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds, some values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#38

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#37)

2 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-03-03 16:30, Fujii Masao wrote:

On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the
original)
You missed the adding the space before an opening parenthesis here
and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL
receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event
is
reported in wal_buffers_full in....) This is undesirable because
..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync
but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe
that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two
places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't
have
any shared code between the two but instead implement the WAL
receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver
stats messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process
running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view
in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+	/* Check whether the WAL file was synced to disk right now */
+	if (enableFsync &&
+		(sync_method == SYNC_METHOD_FSYNC ||
+		 sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+		 sync_method == SYNC_METHOD_FDATASYNC))
+	{
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?

Thanks for the comments.
I added the above code in v12 patch.

+		/*
+		 * Send WAL statistics only if WalWriterDelay has elapsed to 
minimize
+		 * the overhead in WAL-writing.
+		 */
+		if (rc & WL_TIMEOUT)
+			pgstat_send_wal();
On second thought, this change means that it always takes
wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is
called.
For example, if wal_writer_delay is set to several seconds, some values
in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?

Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.

Why don't to make another way to check the timestamp?

+               /*
+                * Don't send a message unless it's been at least 
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now, 
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+

Although I worried that it's better to add the check code in
pgstat_send_wal(),
I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks the
PGSTAT_STAT_INTERVAL.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v12-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; name=v12-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload

From d870cfa78e501097cc56780ebb3140db6b9261e5 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Date: Fri, 12 Feb 2021 11:19:59 +0900
Subject: [PATCH 1/2] Add statistics related to write/sync wal records.

This patch adds following statistics to pg_stat_wal view
to track WAL I/O activity by backends and background processes
except WAL receiver.

- the total number of times writing/syncing WAL data.
- the total amount of time spent writing/syncing WAL data.

Since to track I/O timing may leads significant overhead,
GUC parameter "track_wal_io_timing" is introduced.
Only if this is on, the I/O timing is measured.

The statistics related to sync are zero when "wal_sync_method"
is "open_datasync" or "open_sync", because it doesn't call each
sync method.

(This requires a catversion bump, as well as an update to
 PGSTAT_FILE_FORMAT_ID)

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David
Johnston, Fujii Masao
Discussion:
https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.co
---
 doc/src/sgml/config.sgml                      | 23 +++++++-
 doc/src/sgml/monitoring.sgml                  | 56 ++++++++++++++++++
 doc/src/sgml/wal.sgml                         | 12 +++-
 src/backend/access/transam/xlog.c             | 59 +++++++++++++++++--
 src/backend/catalog/system_views.sql          |  4 ++
 src/backend/postmaster/pgstat.c               |  4 ++
 src/backend/postmaster/walwriter.c            | 21 +++++++
 src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++-
 src/backend/utils/misc/guc.c                  |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h                     |  1 +
 src/include/catalog/pg_proc.dat               |  6 +-
 src/include/pgstat.h                          | 10 ++++
 src/test/regress/expected/rules.out           |  6 +-
 14 files changed, 221 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b5718fc136..c232f537b2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7439,7 +7439,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Enables timing of database I/O calls.  This parameter is off by
-        default, because it will repeatedly query the operating system for
+        default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
         measure the overhead of timing on your system.
@@ -7453,6 +7453,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        as it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3513e127b7..a16be45a71 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3487,6 +3487,62 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent writing WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>),
+       excluding sync time unless 
+       <xref linkend="guc-wal-sync-method"/> is either <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>. Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL files were synced to disk via 
+       <function>issue_xlog_fsync</function>, which is invoked during an 
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the 
+       "sync at commit" options (i.e., <literal>fdatasync</literal>, 
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent syncing WAL files to disk via
+       <function>issue_xlog_fsync</function>, which is invoked during an 
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the 
+       "sync at commit" options (i.e., <literal>fdatasync</literal>, 
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+       Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f75527f764..06e4b37012 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -663,7 +663,9 @@
    the <acronym>WAL</acronym> buffers in shared memory. If there is no
    space for the new record, <function>XLogInsertRecord</function> will have
    to write (move to kernel cache) a few filled <acronym>WAL</acronym>
-   buffers. This is undesirable because <function>XLogInsertRecord</function>
+   buffers (the tally of this event is reported in 
+   <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>).
+   This is undesirable because <function>XLogInsertRecord</function> 
    is used on every database low level modification (for example, row
    insertion) at a time when an exclusive lock is held on affected
    data pages, so the operation needs to be as fast as possible.  What
@@ -672,8 +674,12 @@
    time. Normally, <acronym>WAL</acronym> buffers should be written
    and flushed by an <function>XLogFlush</function> request, which is
    made, for the most part, at transaction commit time to ensure that
-   transaction records are flushed to permanent storage. On systems
-   with high log output, <function>XLogFlush</function> requests might
+   transaction records are flushed to permanent storage. 
+   <function>XLogFlush</function> calls <function>XLogWrite</function> to write 
+   and <function>issue_xlog_fsync</function> to flush them, which are counted as 
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in 
+   <xref linkend="pg-stat-wal-view"/>. On systems with high log output, 
+   <function>XLogFlush</function> requests might
    not occur often enough to prevent <function>XLogInsertRecord</function>
    from having to do writes.  On such systems
    one should increase the number of <acronym>WAL</acronym> buffers by
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fe56324439..fafc1b7fac 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2533,6 +2534,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2541,9 +2543,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure I/O timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				/*
+				 * Increment the I/O timing and the number of times WAL data
+				 * were written out to disk.
+				 */
+				if (track_wal_io_timing)
+				{
+					instr_time	duration;
+
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10526,6 +10549,24 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
 	char	   *msg = NULL;
+	instr_time	start;
+
+	/*
+	 * Check whether the WAL file was synced to disk right now.
+	 *
+	 * If fsync is disabled, never issue fsync method.
+	 *
+	 * If sync_mothod is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC, the WAL
+	 * file is already synced.
+	 */
+	if (!enableFsync ||
+		sync_method == SYNC_METHOD_OPEN ||
+		sync_method == SYNC_METHOD_OPEN_DSYNC)
+		return;
+
+	/* Measure I/O timing to sync the WAL file */
+	if (track_wal_io_timing)
+		INSTR_TIME_SET_CURRENT(start);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10546,10 +10587,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 				msg = _("could not fdatasync file \"%s\": %m");
 			break;
 #endif
-		case SYNC_METHOD_OPEN:
-		case SYNC_METHOD_OPEN_DSYNC:
-			/* write synced it already */
-			break;
 		default:
 			elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
 			break;
@@ -10570,6 +10607,20 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/*
+	 * Increment the I/O timing and the number of times WAL files were synced.
+	 */
+	if (track_wal_io_timing)
+	{
+		instr_time	duration;
+
+		INSTR_TIME_SET_CURRENT(duration);
+		INSTR_TIME_SUBTRACT(duration, start);
+		WalStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
+	}
+
+	WalStats.m_wal_sync++;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73a54..ba5158ba57 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1005,6 +1005,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..987bbd058d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -6892,6 +6892,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..8491e6f6d6 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -78,6 +78,11 @@ int			WalWriterFlushAfter = 128;
 #define LOOPS_UNTIL_HIBERNATE		50
 #define HIBERNATE_FACTOR			25
 
+/*
+ * Minimum time between stats file updates; in milliseconds.
+ */
+#define PGSTAT_STAT_INTERVAL	500
+
 /*
  * Main entry point for walwriter process
  *
@@ -222,7 +227,12 @@ WalWriterMain(void)
 	 */
 	for (;;)
 	{
+		/* we assume this inits to all zeroes: */
+		static TimestampTz last_report = 0;
+
 		long		cur_timeout;
+		int			rc;
+		TimestampTz now;
 
 		/*
 		 * Advertise whether we might hibernate in this cycle.  We do this
@@ -253,6 +263,17 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/*
+		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+		 * msec since we last sent one
+		 */
+		now = GetCurrentTimestamp();
+		if (TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
+		{
+			pgstat_send_wal();
+			last_report = now;
+		}
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..7296ef04ff 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,17 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+
+	/* convert counter from microsec to millisec for display */
+	values[5] = Float8GetDatum((double) wal_stats->wal_write_time / 1000.0);
+
+	values[6] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* convert counter from microsec to millisec for display */
+	values[7] = Float8GetDatum((double) wal_stats->wal_sync_time / 1000.0);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c5b7f01e1a..958e84a962 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee06528bb0..c6483fa1ff 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,6 +586,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 447f9ae44d..91ac0479c2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5543,9 +5543,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,float8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_write_time,wal_sync,wal_sync_time,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..000bb14c0b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,12 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_write_time;	/* time spend writing wal records in
+										 * micro seconds */
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
+									 * seconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +845,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b1c9b7bdfe..fce6ab8338 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2160,8 +2160,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_write_time,
+    w.wal_sync,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_write_time, wal_sync, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,
-- 
2.25.1

v11_v12_0001.difftext/x-diff; name=v11_v12_0001.diffDownload

--- v11-0001-Add-statistics-related-to-write-sync-wal-records.patch	2021-03-03 14:27:02.119713795 +0900
+++ v12-0001-Add-statistics-related-to-write-sync-wal-records.patch	2021-03-03 20:22:47.258763274 +0900
@@ -1,4 +1,4 @@
-From c97fff03cd2bd51b28f6c6fde56c48792683f44e Mon Sep 17 00:00:00 2001
+From d870cfa78e501097cc56780ebb3140db6b9261e5 Mon Sep 17 00:00:00 2001
 From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
 Date: Fri, 12 Feb 2021 11:19:59 +0900
 Subject: [PATCH 1/2] Add statistics related to write/sync wal records.
@@ -30,10 +30,10 @@
  doc/src/sgml/config.sgml                      | 23 +++++++-
  doc/src/sgml/monitoring.sgml                  | 56 ++++++++++++++++++
  doc/src/sgml/wal.sgml                         | 12 +++-
- src/backend/access/transam/xlog.c             | 57 +++++++++++++++++++
+ src/backend/access/transam/xlog.c             | 59 +++++++++++++++++--
  src/backend/catalog/system_views.sql          |  4 ++
  src/backend/postmaster/pgstat.c               |  4 ++
- src/backend/postmaster/walwriter.c            | 16 ++++--
+ src/backend/postmaster/walwriter.c            | 21 +++++++
  src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++-
  src/backend/utils/misc/guc.c                  |  9 +++
  src/backend/utils/misc/postgresql.conf.sample |  1 +
@@ -41,7 +41,7 @@
  src/include/catalog/pg_proc.dat               |  6 +-
  src/include/pgstat.h                          | 10 ++++
  src/test/regress/expected/rules.out           |  6 +-
- 14 files changed, 214 insertions(+), 15 deletions(-)
+ 14 files changed, 221 insertions(+), 15 deletions(-)
 
 diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
 index b5718fc136..c232f537b2 100644
@@ -182,7 +182,7 @@
     from having to do writes.  On such systems
     one should increase the number of <acronym>WAL</acronym> buffers by
 diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
-index fe56324439..98f558b4c7 100644
+index fe56324439..fafc1b7fac 100644
 --- a/src/backend/access/transam/xlog.c
 +++ b/src/backend/access/transam/xlog.c
 @@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
@@ -232,51 +232,60 @@
  				if (written <= 0)
  				{
  					char		xlogfname[MAXFNAMELEN];
-@@ -10526,6 +10549,21 @@ void
+@@ -10526,6 +10549,24 @@ void
  issue_xlog_fsync(int fd, XLogSegNo segno)
  {
  	char	   *msg = NULL;
-+	bool		issue_fsync = false;
 +	instr_time	start;
 +
-+	/* Check whether the WAL file was synced to disk right now */
-+	if (enableFsync &&
-+		(sync_method == SYNC_METHOD_FSYNC ||
-+		 sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
-+		 sync_method == SYNC_METHOD_FDATASYNC))
-+	{
-+		/* Measure I/O timing to sync the WAL file */
-+		if (track_wal_io_timing)
-+			INSTR_TIME_SET_CURRENT(start);
-+
-+		issue_fsync = true;
-+	}
++	/*
++	 * Check whether the WAL file was synced to disk right now.
++	 *
++	 * If fsync is disabled, never issue fsync method.
++	 *
++	 * If sync_mothod is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC, the WAL
++	 * file is already synced.
++	 */
++	if (!enableFsync ||
++		sync_method == SYNC_METHOD_OPEN ||
++		sync_method == SYNC_METHOD_OPEN_DSYNC)
++		return;
++
++	/* Measure I/O timing to sync the WAL file */
++	if (track_wal_io_timing)
++		INSTR_TIME_SET_CURRENT(start);
  
  	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
  	switch (sync_method)
-@@ -10570,6 +10608,25 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
+@@ -10546,10 +10587,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
+ 				msg = _("could not fdatasync file \"%s\": %m");
+ 			break;
+ #endif
+-		case SYNC_METHOD_OPEN:
+-		case SYNC_METHOD_OPEN_DSYNC:
+-			/* write synced it already */
+-			break;
+ 		default:
+ 			elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
+ 			break;
+@@ -10570,6 +10607,20 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
  	}
  
  	pgstat_report_wait_end();
 +
 +	/*
 +	 * Increment the I/O timing and the number of times WAL files were synced.
-+	 *
-+	 * Check whether the WAL file was synced to disk right now because
-+	 * statistics must be incremented when syncing really occurred.
 +	 */
-+	if (issue_fsync)
++	if (track_wal_io_timing)
 +	{
-+		if (track_wal_io_timing)
-+		{
-+			instr_time	duration;
++		instr_time	duration;
 +
-+			INSTR_TIME_SET_CURRENT(duration);
-+			INSTR_TIME_SUBTRACT(duration, start);
-+			WalStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
-+		}
-+		WalStats.m_wal_sync++;
++		INSTR_TIME_SET_CURRENT(duration);
++		INSTR_TIME_SUBTRACT(duration, start);
++		WalStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
 +	}
++
++	WalStats.m_wal_sync++;
  }
  
  /*
@@ -311,38 +320,52 @@
  
  /* ----------
 diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
-index 4f1a8e356b..08fa7032c0 100644
+index 4f1a8e356b..8491e6f6d6 100644
 --- a/src/backend/postmaster/walwriter.c
 +++ b/src/backend/postmaster/walwriter.c
-@@ -223,6 +223,7 @@ WalWriterMain(void)
+@@ -78,6 +78,11 @@ int			WalWriterFlushAfter = 128;
+ #define LOOPS_UNTIL_HIBERNATE		50
+ #define HIBERNATE_FACTOR			25
+ 
++/*
++ * Minimum time between stats file updates; in milliseconds.
++ */
++#define PGSTAT_STAT_INTERVAL	500
++
+ /*
+  * Main entry point for walwriter process
+  *
+@@ -222,7 +227,12 @@ WalWriterMain(void)
+ 	 */
  	for (;;)
  	{
++		/* we assume this inits to all zeroes: */
++		static TimestampTz last_report = 0;
++
  		long		cur_timeout;
 +		int			rc;
++		TimestampTz now;
  
  		/*
  		 * Advertise whether we might hibernate in this cycle.  We do this
-@@ -263,9 +264,16 @@ WalWriterMain(void)
- 		else
- 			cur_timeout = WalWriterDelay * HIBERNATE_FACTOR;
- 
--		(void) WaitLatch(MyLatch,
--						 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
--						 cur_timeout,
--						 WAIT_EVENT_WAL_WRITER_MAIN);
-+		rc = WaitLatch(MyLatch,
-+					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-+					   cur_timeout,
-+					   WAIT_EVENT_WAL_WRITER_MAIN);
-+
+@@ -253,6 +263,17 @@ WalWriterMain(void)
+ 		else if (left_till_hibernate > 0)
+ 			left_till_hibernate--;
+ 
 +		/*
-+		 * Send WAL statistics only if WalWriterDelay has elapsed to minimize
-+		 * the overhead in WAL-writing.
++		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
++		 * msec since we last sent one
 +		 */
-+		if (rc & WL_TIMEOUT)
++		now = GetCurrentTimestamp();
++		if (TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
++		{
 +			pgstat_send_wal();
- 	}
- }
++			last_report = now;
++		}
++
+ 		/*
+ 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
+ 		 * haven't done anything useful for quite some time, lengthen the
 diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
 index 62bff52638..7296ef04ff 100644
 --- a/src/backend/utils/adt/pgstatfuncs.c

#39

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#38)

2 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-03-03 20:27, Masahiro Ikeda wrote:

On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the
original)
You missed the adding the space before an opening parenthesis here
and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is
also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL
receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event
is
reported in wal_buffers_full in....) This is undesirable because
..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require
explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync
but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe
that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two
places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't
have
any shared code between the two but instead implement the WAL
receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver
stats messages between the WAL receiver and the stats collector,
and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process
running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal
view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+	/* Check whether the WAL file was synced to disk right now */
+	if (enableFsync &&
+		(sync_method == SYNC_METHOD_FSYNC ||
+		 sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+		 sync_method == SYNC_METHOD_FDATASYNC))
+	{
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+		/*
+		 * Send WAL statistics only if WalWriterDelay has elapsed to 
minimize
+		 * the overhead in WAL-writing.
+		 */
+		if (rc & WL_TIMEOUT)
+			pgstat_send_wal();
On second thought, this change means that it always takes
wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is
called.
For example, if wal_writer_delay is set to several seconds, some
values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.

Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in
pgstat_send_wal(),
I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks the
PGSTAT_STAT_INTERVAL.

I forgot to remove an unused variable.
The attached v13 patch is fixed.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v13-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; name=v13-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload

From f2353a6226ca900d1689829824d0070bbf02f42b Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Date: Fri, 12 Feb 2021 11:19:59 +0900
Subject: [PATCH 1/2] Add statistics related to write/sync wal records.

This patch adds following statistics to pg_stat_wal view
to track WAL I/O activity by backends and background processes
except WAL receiver.

- the total number of times writing/syncing WAL data.
- the total amount of time spent writing/syncing WAL data.

Since to track I/O timing may leads significant overhead,
GUC parameter "track_wal_io_timing" is introduced.
Only if this is on, the I/O timing is measured.

The statistics related to sync are zero when "wal_sync_method"
is "open_datasync" or "open_sync", because it doesn't call each
sync method.

(This requires a catversion bump, as well as an update to
 PGSTAT_FILE_FORMAT_ID)

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David
Johnston, Fujii Masao
Discussion:
https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.co
---
 doc/src/sgml/config.sgml                      | 23 +++++++-
 doc/src/sgml/monitoring.sgml                  | 56 ++++++++++++++++++
 doc/src/sgml/wal.sgml                         | 12 +++-
 src/backend/access/transam/xlog.c             | 59 +++++++++++++++++--
 src/backend/catalog/system_views.sql          |  4 ++
 src/backend/postmaster/pgstat.c               |  4 ++
 src/backend/postmaster/walwriter.c            | 20 +++++++
 src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++-
 src/backend/utils/misc/guc.c                  |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h                     |  1 +
 src/include/catalog/pg_proc.dat               |  6 +-
 src/include/pgstat.h                          | 10 ++++
 src/test/regress/expected/rules.out           |  6 +-
 14 files changed, 220 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b5718fc136..c232f537b2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7439,7 +7439,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Enables timing of database I/O calls.  This parameter is off by
-        default, because it will repeatedly query the operating system for
+        default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
         measure the overhead of timing on your system.
@@ -7453,6 +7453,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        as it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3513e127b7..a16be45a71 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3487,6 +3487,62 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent writing WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>),
+       excluding sync time unless 
+       <xref linkend="guc-wal-sync-method"/> is either <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>. Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL files were synced to disk via 
+       <function>issue_xlog_fsync</function>, which is invoked during an 
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the 
+       "sync at commit" options (i.e., <literal>fdatasync</literal>, 
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent syncing WAL files to disk via
+       <function>issue_xlog_fsync</function>, which is invoked during an 
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the 
+       "sync at commit" options (i.e., <literal>fdatasync</literal>, 
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+       Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f75527f764..06e4b37012 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -663,7 +663,9 @@
    the <acronym>WAL</acronym> buffers in shared memory. If there is no
    space for the new record, <function>XLogInsertRecord</function> will have
    to write (move to kernel cache) a few filled <acronym>WAL</acronym>
-   buffers. This is undesirable because <function>XLogInsertRecord</function>
+   buffers (the tally of this event is reported in 
+   <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>).
+   This is undesirable because <function>XLogInsertRecord</function> 
    is used on every database low level modification (for example, row
    insertion) at a time when an exclusive lock is held on affected
    data pages, so the operation needs to be as fast as possible.  What
@@ -672,8 +674,12 @@
    time. Normally, <acronym>WAL</acronym> buffers should be written
    and flushed by an <function>XLogFlush</function> request, which is
    made, for the most part, at transaction commit time to ensure that
-   transaction records are flushed to permanent storage. On systems
-   with high log output, <function>XLogFlush</function> requests might
+   transaction records are flushed to permanent storage. 
+   <function>XLogFlush</function> calls <function>XLogWrite</function> to write 
+   and <function>issue_xlog_fsync</function> to flush them, which are counted as 
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in 
+   <xref linkend="pg-stat-wal-view"/>. On systems with high log output, 
+   <function>XLogFlush</function> requests might
    not occur often enough to prevent <function>XLogInsertRecord</function>
    from having to do writes.  On such systems
    one should increase the number of <acronym>WAL</acronym> buffers by
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fe56324439..fafc1b7fac 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2533,6 +2534,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2541,9 +2543,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure I/O timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				/*
+				 * Increment the I/O timing and the number of times WAL data
+				 * were written out to disk.
+				 */
+				if (track_wal_io_timing)
+				{
+					instr_time	duration;
+
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10526,6 +10549,24 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
 	char	   *msg = NULL;
+	instr_time	start;
+
+	/*
+	 * Check whether the WAL file was synced to disk right now.
+	 *
+	 * If fsync is disabled, never issue fsync method.
+	 *
+	 * If sync_mothod is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC, the WAL
+	 * file is already synced.
+	 */
+	if (!enableFsync ||
+		sync_method == SYNC_METHOD_OPEN ||
+		sync_method == SYNC_METHOD_OPEN_DSYNC)
+		return;
+
+	/* Measure I/O timing to sync the WAL file */
+	if (track_wal_io_timing)
+		INSTR_TIME_SET_CURRENT(start);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10546,10 +10587,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 				msg = _("could not fdatasync file \"%s\": %m");
 			break;
 #endif
-		case SYNC_METHOD_OPEN:
-		case SYNC_METHOD_OPEN_DSYNC:
-			/* write synced it already */
-			break;
 		default:
 			elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
 			break;
@@ -10570,6 +10607,20 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/*
+	 * Increment the I/O timing and the number of times WAL files were synced.
+	 */
+	if (track_wal_io_timing)
+	{
+		instr_time	duration;
+
+		INSTR_TIME_SET_CURRENT(duration);
+		INSTR_TIME_SUBTRACT(duration, start);
+		WalStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
+	}
+
+	WalStats.m_wal_sync++;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73a54..ba5158ba57 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1005,6 +1005,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..987bbd058d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -6892,6 +6892,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..95cad4a9ac 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -78,6 +78,11 @@ int			WalWriterFlushAfter = 128;
 #define LOOPS_UNTIL_HIBERNATE		50
 #define HIBERNATE_FACTOR			25
 
+/*
+ * Minimum time between stats file updates; in milliseconds.
+ */
+#define PGSTAT_STAT_INTERVAL	500
+
 /*
  * Main entry point for walwriter process
  *
@@ -222,7 +227,11 @@ WalWriterMain(void)
 	 */
 	for (;;)
 	{
+		/* we assume this inits to all zeroes: */
+		static TimestampTz last_report = 0;
+
 		long		cur_timeout;
+		TimestampTz now;
 
 		/*
 		 * Advertise whether we might hibernate in this cycle.  We do this
@@ -253,6 +262,17 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/*
+		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+		 * msec since we last sent one
+		 */
+		now = GetCurrentTimestamp();
+		if (TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
+		{
+			pgstat_send_wal();
+			last_report = now;
+		}
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..7296ef04ff 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,17 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+
+	/* convert counter from microsec to millisec for display */
+	values[5] = Float8GetDatum((double) wal_stats->wal_write_time / 1000.0);
+
+	values[6] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* convert counter from microsec to millisec for display */
+	values[7] = Float8GetDatum((double) wal_stats->wal_sync_time / 1000.0);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c5b7f01e1a..958e84a962 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee06528bb0..c6483fa1ff 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,6 +586,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 447f9ae44d..91ac0479c2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5543,9 +5543,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,float8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_write_time,wal_sync,wal_sync_time,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..000bb14c0b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,12 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_write_time;	/* time spend writing wal records in
+										 * micro seconds */
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
+									 * seconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +845,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b1c9b7bdfe..fce6ab8338 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2160,8 +2160,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_write_time,
+    w.wal_sync,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_write_time, wal_sync, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,
-- 
2.25.1

v12_v13.difftext/x-diff; name=v12_v13.diffDownload

--- v12-0001-Add-statistics-related-to-write-sync-wal-records.patch	2021-03-03 20:22:47.258763274 +0900
+++ v13-0001-Add-statistics-related-to-write-sync-wal-records.patch	2021-03-04 16:12:06.702717779 +0900
@@ -1,4 +1,4 @@
-From d870cfa78e501097cc56780ebb3140db6b9261e5 Mon Sep 17 00:00:00 2001
+From f2353a6226ca900d1689829824d0070bbf02f42b Mon Sep 17 00:00:00 2001
 From: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
 Date: Fri, 12 Feb 2021 11:19:59 +0900
 Subject: [PATCH 1/2] Add statistics related to write/sync wal records.
@@ -33,7 +33,7 @@
  src/backend/access/transam/xlog.c             | 59 +++++++++++++++++--
  src/backend/catalog/system_views.sql          |  4 ++
  src/backend/postmaster/pgstat.c               |  4 ++
- src/backend/postmaster/walwriter.c            | 21 +++++++
+ src/backend/postmaster/walwriter.c            | 20 +++++++
  src/backend/utils/adt/pgstatfuncs.c           | 24 +++++++-
  src/backend/utils/misc/guc.c                  |  9 +++
  src/backend/utils/misc/postgresql.conf.sample |  1 +
@@ -41,7 +41,7 @@
  src/include/catalog/pg_proc.dat               |  6 +-
  src/include/pgstat.h                          | 10 ++++
  src/test/regress/expected/rules.out           |  6 +-
- 14 files changed, 221 insertions(+), 15 deletions(-)
+ 14 files changed, 220 insertions(+), 15 deletions(-)
 
 diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
 index b5718fc136..c232f537b2 100644
@@ -320,7 +320,7 @@
  
  /* ----------
 diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
-index 4f1a8e356b..8491e6f6d6 100644
+index 4f1a8e356b..95cad4a9ac 100644
 --- a/src/backend/postmaster/walwriter.c
 +++ b/src/backend/postmaster/walwriter.c
 @@ -78,6 +78,11 @@ int			WalWriterFlushAfter = 128;
@@ -335,7 +335,7 @@
  /*
   * Main entry point for walwriter process
   *
-@@ -222,7 +227,12 @@ WalWriterMain(void)
+@@ -222,7 +227,11 @@ WalWriterMain(void)
  	 */
  	for (;;)
  	{
@@ -343,12 +343,11 @@
 +		static TimestampTz last_report = 0;
 +
  		long		cur_timeout;
-+		int			rc;
 +		TimestampTz now;
  
  		/*
  		 * Advertise whether we might hibernate in this cycle.  We do this
-@@ -253,6 +263,17 @@ WalWriterMain(void)
+@@ -253,6 +262,17 @@ WalWriterMain(void)
  		else if (left_till_hibernate > 0)
  			left_till_hibernate--;

#40

Ibrar Ahmed

ibrar.ahmad@gmail.com

almost 5 years ago

In reply to: Masahiro Ikeda (#39)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On Thu, Mar 4, 2021 at 12:14 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com>
wrote:

On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the
original)
You missed the adding the space before an opening parenthesis here
and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is
also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL
receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event
is
reported in wal_buffers_full in....) This is undesirable because
..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require
explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync
but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe
that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two
places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't
have
any shared code between the two but instead implement the WAL
receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver
stats messages between the WAL receiver and the stats collector,
and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process
running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal
view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+            (sync_method == SYNC_METHOD_FSYNC ||
+             sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+             sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+            /*
+             * Send WAL statistics only if WalWriterDelay has elapsed
to
minimize
+             * the overhead in WAL-writing.
+             */
+            if (rc & WL_TIMEOUT)
+                    pgstat_send_wal();
On second thought, this change means that it always takes
wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is
called.
For example, if wal_writer_delay is set to several seconds, some
values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.

Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in
pgstat_send_wal(),
I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks the
PGSTAT_STAT_INTERVAL.
I forgot to remove an unused variable.
The attached v13 patch is fixed.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATION

This patch set no longer applies
http://cfbot.cputube.org/patch_32_2859.log

Can we get a rebase?

I am marking the patch "Waiting on Author"

--
Ibrar Ahmed

#41

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#39)

1 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/04 16:14, Masahiro Ikeda wrote:

On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds, some values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.

Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.

Why don't to make another way to check the timestamp?

+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+

Although I worried that it's better to add the check code in pgstat_send_wal(),

Agreed.

I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?

I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!

+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,

It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time

- case SYNC_METHOD_OPEN:
- case SYNC_METHOD_OPEN_DSYNC:
- /* write synced it already */
- break;

IMO it's better to add Assert(false) here to ensure that we never reach
here, as follows. Thought?

+		case SYNC_METHOD_OPEN:
+		case SYNC_METHOD_OPEN_DSYNC:
+			/* not reachable */
+			Assert(false);

Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should send
the stats even at its exit? Otherwise some stats can fail to be collected.
But ISTM that this issue existed from before, for example checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
this issue in this patch?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

v13-0001-Add-statistics-related-to-write-sync-wal-records_fujii.patchtext/plain; charset=UTF-8; name=v13-0001-Add-statistics-related-to-write-sync-wal-records_fujii.patch; x-mac-creator=0; x-mac-type=0Download

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 967de73596..56eb55bab7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7450,7 +7450,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Enables timing of database I/O calls.  This parameter is off by
-        default, because it will repeatedly query the operating system for
+        default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
         measure the overhead of timing on your system.
@@ -7464,6 +7464,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        as it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3513e127b7..a16be45a71 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3487,6 +3487,62 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent writing WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>),
+       excluding sync time unless 
+       <xref linkend="guc-wal-sync-method"/> is either <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>. Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL files were synced to disk via 
+       <function>issue_xlog_fsync</function>, which is invoked during an 
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the 
+       "sync at commit" options (i.e., <literal>fdatasync</literal>, 
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent syncing WAL files to disk via
+       <function>issue_xlog_fsync</function>, which is invoked during an 
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the 
+       "sync at commit" options (i.e., <literal>fdatasync</literal>, 
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+       Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f75527f764..06e4b37012 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -663,7 +663,9 @@
    the <acronym>WAL</acronym> buffers in shared memory. If there is no
    space for the new record, <function>XLogInsertRecord</function> will have
    to write (move to kernel cache) a few filled <acronym>WAL</acronym>
-   buffers. This is undesirable because <function>XLogInsertRecord</function>
+   buffers (the tally of this event is reported in 
+   <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>).
+   This is undesirable because <function>XLogInsertRecord</function> 
    is used on every database low level modification (for example, row
    insertion) at a time when an exclusive lock is held on affected
    data pages, so the operation needs to be as fast as possible.  What
@@ -672,8 +674,12 @@
    time. Normally, <acronym>WAL</acronym> buffers should be written
    and flushed by an <function>XLogFlush</function> request, which is
    made, for the most part, at transaction commit time to ensure that
-   transaction records are flushed to permanent storage. On systems
-   with high log output, <function>XLogFlush</function> requests might
+   transaction records are flushed to permanent storage. 
+   <function>XLogFlush</function> calls <function>XLogWrite</function> to write 
+   and <function>issue_xlog_fsync</function> to flush them, which are counted as 
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in 
+   <xref linkend="pg-stat-wal-view"/>. On systems with high log output, 
+   <function>XLogFlush</function> requests might
    not occur often enough to prevent <function>XLogInsertRecord</function>
    from having to do writes.  On such systems
    one should increase the number of <acronym>WAL</acronym> buffers by
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 377afb8732..1ad19b189a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2533,6 +2534,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2541,9 +2543,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure I/O timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				/*
+				 * Increment the I/O timing and the number of times WAL data
+				 * were written out to disk.
+				 */
+				if (track_wal_io_timing)
+				{
+					instr_time	duration;
+
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10524,6 +10547,20 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
 	char	   *msg = NULL;
+	instr_time	start;
+
+	/*
+	 * Quick exit if fsync is disabled or write() has already synced the WAL
+	 * file.
+	 */
+	if (!enableFsync ||
+		sync_method == SYNC_METHOD_OPEN ||
+		sync_method == SYNC_METHOD_OPEN_DSYNC)
+		return;
+
+	/* Measure I/O timing to sync the WAL file */
+	if (track_wal_io_timing)
+		INSTR_TIME_SET_CURRENT(start);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10546,8 +10583,10 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 #endif
 		case SYNC_METHOD_OPEN:
 		case SYNC_METHOD_OPEN_DSYNC:
-			/* write synced it already */
+			/* not reachable */
+			Assert(false);
 			break;
+
 		default:
 			elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
 			break;
@@ -10568,6 +10607,20 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/*
+	 * Increment the I/O timing and the number of times WAL files were synced.
+	 */
+	if (track_wal_io_timing)
+	{
+		instr_time	duration;
+
+		INSTR_TIME_SET_CURRENT(duration);
+		INSTR_TIME_SUBTRACT(duration, start);
+		WalStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
+	}
+
+	WalStats.m_wal_sync++;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73a54..ba5158ba57 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1005,6 +1005,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 76f9f98ebb..57c4d5a5d9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_send_wal();
+		pgstat_report_wal();
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..cf1d3ea366 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -146,8 +146,8 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_send_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_send_wal() calls, by substracting
+ * pgstat_report_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_report_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
@@ -975,7 +975,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_send_wal();
+	pgstat_report_wal();
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -3118,7 +3118,7 @@ pgstat_initialize(void)
 	}
 
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
@@ -4667,17 +4667,17 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_wal() -
+ * pgstat_report_wal() -
  *
- *		Send WAL statistics to the collector
+ * Calculate how much WAL usage counters are increased and send
+ * WAL statistics to the collector.
+ *
+ * Must be called by processes that generate WAL.
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_report_wal(void)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
-
 	WalUsage	walusage;
 
 	/*
@@ -4692,6 +4692,33 @@ pgstat_send_wal(void)
 	WalStats.m_wal_fpi = walusage.wal_fpi;
 	WalStats.m_wal_bytes = walusage.wal_bytes;
 
+	/*
+	 * Send WAL stats message to the collector.
+	 */
+	pgstat_send_wal(true);
+
+	/*
+	 * Save the current counters for the subsequent calculation of WAL usage.
+	 */
+	prevWalUsage = pgWalUsage;
+}
+
+/* ----------
+ * pgstat_send_wal() -
+ *
+ *	Send WAL statistics to the collector.
+ *
+ * If 'force' is not set, WAL stats message is only sent if enough time has
+ * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
+ * ----------
+ */
+void
+pgstat_send_wal(bool force)
+{
+	/* We assume this initializes to zeroes */
+	static const PgStat_MsgWal all_zeroes;
+	static TimestampTz sendTime = 0;
+
 	/*
 	 * This function can be called even if nothing at all has happened. In
 	 * this case, avoid sending a completely empty message to the stats
@@ -4700,17 +4727,25 @@ pgstat_send_wal(void)
 	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
 		return;
 
+	if (!force)
+	{
+		TimestampTz now = GetCurrentTimestamp();
+
+		/*
+		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+		 * msec since we last sent one.
+		 */
+		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
+			return;
+		sendTime = now;
+	}
+
 	/*
 	 * Prepare and send the message
 	 */
 	pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
 	pgstat_send(&WalStats, sizeof(WalStats));
 
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-
 	/*
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
@@ -6892,6 +6927,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..43c9709b79 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -78,6 +78,11 @@ int			WalWriterFlushAfter = 128;
 #define LOOPS_UNTIL_HIBERNATE		50
 #define HIBERNATE_FACTOR			25
 
+/*
+ * Minimum time between stats file updates; in milliseconds.
+ */
+#define PGSTAT_STAT_INTERVAL	500
+
 /*
  * Main entry point for walwriter process
  *
@@ -253,6 +258,9 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/* Send WAL statistics to the stats collector */
+		pgstat_send_wal(false);
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..7296ef04ff 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,17 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+
+	/* convert counter from microsec to millisec for display */
+	values[5] = Float8GetDatum((double) wal_stats->wal_write_time / 1000.0);
+
+	values[6] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* convert counter from microsec to millisec for display */
+	values[7] = Float8GetDatum((double) wal_stats->wal_sync_time / 1000.0);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3fd1a5fbe2..e337df42cb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee06528bb0..c6483fa1ff 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,6 +586,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 59d2b71ca9..71370962f4 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5545,9 +5545,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,float8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_write_time,wal_sync,wal_sync_time,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..0f16f0fef8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,12 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_write_time;	/* time spend writing wal records in
+										 * micro seconds */
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
+									 * seconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +845,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -1590,7 +1600,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_report_wal(void);
+extern void pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b1c9b7bdfe..fce6ab8338 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2160,8 +2160,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_write_time,
+    w.wal_sync,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_write_time, wal_sync, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,

#42

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#41)

2 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-03-05 01:02, Fujii Masao wrote:

On 2021/03/04 16:14, Masahiro Ikeda wrote:
On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the
original)
You missed the adding the space before an opening parenthesis
here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query
the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is
also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL
receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this
event is
reported in wal_buffers_full in....) This is undesirable because
..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require
explicitly
computing the sync statistics but does require computing the
write
statistics. This is because of the presence of issue_xlog_fsync
but
absence of an equivalent pg_xlog_pwrite. Additionally, I
observe that
the XLogWrite code path calls pgstat_report_wait_*() while the
WAL
receiver path does not. It seems technically straight-forward
to
refactor here to avoid the almost-duplicated logic in the two
places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL
processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and
don't have
any shared code between the two but instead implement the WAL
receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver
stats messages between the WAL receiver and the stats collector,
and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats
are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process
running
at that moment. IOW, it seems strange that some values show
dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal
view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has elapsed 
to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes
wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is
called.
For example, if wal_writer_delay is set to several seconds, some
values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.
Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.
Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in
pgstat_send_wal(),
Agreed.

I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks
the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?

Thanks, I thought it's better.

I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time

Yes, I fixed it.

- case SYNC_METHOD_OPEN:
- case SYNC_METHOD_OPEN_DSYNC:
- /* write synced it already */
- break;

IMO it's better to add Assert(false) here to ensure that we never reach
here, as follows. Thought?
+		case SYNC_METHOD_OPEN:
+		case SYNC_METHOD_OPEN_DSYNC:
+			/* not reachable */
+			Assert(false);

I agree.

Even when a backend exits, it sends the stats via
pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should
send
the stats even at its exit? Otherwise some stats can fail to be
collected.
But ISTM that this issue existed from before, for example checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to
fix
this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer in
v14-0003 patch.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v14-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; name=v14-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 967de73596..56eb55bab7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7450,7 +7450,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Enables timing of database I/O calls.  This parameter is off by
-        default, because it will repeatedly query the operating system for
+        default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
         measure the overhead of timing on your system.
@@ -7464,6 +7464,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        as it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3513e127b7..1520cef505 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3487,6 +3487,62 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL files were synced to disk via 
+       <function>issue_xlog_fsync</function>, which is invoked during an 
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the 
+       "sync at commit" options (i.e., <literal>fdatasync</literal>, 
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent writing WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>),
+       excluding sync time unless 
+       <xref linkend="guc-wal-sync-method"/> is either <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>. Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent syncing WAL files to disk via
+       <function>issue_xlog_fsync</function>, which is invoked during an 
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the 
+       "sync at commit" options (i.e., <literal>fdatasync</literal>, 
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+       Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f75527f764..06e4b37012 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -663,7 +663,9 @@
    the <acronym>WAL</acronym> buffers in shared memory. If there is no
    space for the new record, <function>XLogInsertRecord</function> will have
    to write (move to kernel cache) a few filled <acronym>WAL</acronym>
-   buffers. This is undesirable because <function>XLogInsertRecord</function>
+   buffers (the tally of this event is reported in 
+   <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>).
+   This is undesirable because <function>XLogInsertRecord</function> 
    is used on every database low level modification (for example, row
    insertion) at a time when an exclusive lock is held on affected
    data pages, so the operation needs to be as fast as possible.  What
@@ -672,8 +674,12 @@
    time. Normally, <acronym>WAL</acronym> buffers should be written
    and flushed by an <function>XLogFlush</function> request, which is
    made, for the most part, at transaction commit time to ensure that
-   transaction records are flushed to permanent storage. On systems
-   with high log output, <function>XLogFlush</function> requests might
+   transaction records are flushed to permanent storage. 
+   <function>XLogFlush</function> calls <function>XLogWrite</function> to write 
+   and <function>issue_xlog_fsync</function> to flush them, which are counted as 
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in 
+   <xref linkend="pg-stat-wal-view"/>. On systems with high log output, 
+   <function>XLogFlush</function> requests might
    not occur often enough to prevent <function>XLogInsertRecord</function>
    from having to do writes.  On such systems
    one should increase the number of <acronym>WAL</acronym> buffers by
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fe56324439..24c3dd32f8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2533,6 +2534,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2541,9 +2543,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure I/O timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				/*
+				 * Increment the I/O timing and the number of times WAL data
+				 * were written out to disk.
+				 */
+				if (track_wal_io_timing)
+				{
+					instr_time	duration;
+
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10526,6 +10549,20 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
 	char	   *msg = NULL;
+	instr_time	start;
+
+	/*
+	 * Quick exit if fsync is disabled or write() has already synced the WAL
+	 * file.
+	 */
+	if (!enableFsync ||
+		sync_method == SYNC_METHOD_OPEN ||
+		sync_method == SYNC_METHOD_OPEN_DSYNC)
+		return;
+
+	/* Measure I/O timing to sync the WAL file */
+	if (track_wal_io_timing)
+		INSTR_TIME_SET_CURRENT(start);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10548,7 +10585,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 #endif
 		case SYNC_METHOD_OPEN:
 		case SYNC_METHOD_OPEN_DSYNC:
-			/* write synced it already */
+			/* not reachable */
+			Assert(false);
 			break;
 		default:
 			elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
@@ -10570,6 +10608,20 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/*
+	 * Increment the I/O timing and the number of times WAL files were synced.
+	 */
+	if (track_wal_io_timing)
+	{
+		instr_time	duration;
+
+		INSTR_TIME_SET_CURRENT(duration);
+		INSTR_TIME_SUBTRACT(duration, start);
+		WalStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
+	}
+
+	WalStats.m_wal_sync++;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73a54..51ba1b5826 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1005,6 +1005,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_sync,
+        w.wal_write_time,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0bbeece19d..3894f4a270 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_send_wal();
+		pgstat_report_wal();
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..cf1d3ea366 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -146,8 +146,8 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_send_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_send_wal() calls, by substracting
+ * pgstat_report_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_report_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
@@ -975,7 +975,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_send_wal();
+	pgstat_report_wal();
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -3118,7 +3118,7 @@ pgstat_initialize(void)
 	}
 
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
@@ -4667,17 +4667,17 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_wal() -
+ * pgstat_report_wal() -
  *
- *		Send WAL statistics to the collector
+ * Calculate how much WAL usage counters are increased and send
+ * WAL statistics to the collector.
+ *
+ * Must be called by processes that generate WAL.
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_report_wal(void)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
-
 	WalUsage	walusage;
 
 	/*
@@ -4692,6 +4692,33 @@ pgstat_send_wal(void)
 	WalStats.m_wal_fpi = walusage.wal_fpi;
 	WalStats.m_wal_bytes = walusage.wal_bytes;
 
+	/*
+	 * Send WAL stats message to the collector.
+	 */
+	pgstat_send_wal(true);
+
+	/*
+	 * Save the current counters for the subsequent calculation of WAL usage.
+	 */
+	prevWalUsage = pgWalUsage;
+}
+
+/* ----------
+ * pgstat_send_wal() -
+ *
+ *	Send WAL statistics to the collector.
+ *
+ * If 'force' is not set, WAL stats message is only sent if enough time has
+ * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
+ * ----------
+ */
+void
+pgstat_send_wal(bool force)
+{
+	/* We assume this initializes to zeroes */
+	static const PgStat_MsgWal all_zeroes;
+	static TimestampTz sendTime = 0;
+
 	/*
 	 * This function can be called even if nothing at all has happened. In
 	 * this case, avoid sending a completely empty message to the stats
@@ -4700,17 +4727,25 @@ pgstat_send_wal(void)
 	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
 		return;
 
+	if (!force)
+	{
+		TimestampTz now = GetCurrentTimestamp();
+
+		/*
+		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+		 * msec since we last sent one.
+		 */
+		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
+			return;
+		sendTime = now;
+	}
+
 	/*
 	 * Prepare and send the message
 	 */
 	pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
 	pgstat_send(&WalStats, sizeof(WalStats));
 
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-
 	/*
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
@@ -6892,6 +6927,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..132df29aba 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,6 +253,9 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/* Send WAL statistics to the stats collector */
+		pgstat_send_wal(false);
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..c7540eec12 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,13 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+	values[5] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* convert counter from microsec to millisec for display */
+	values[6] = Float8GetDatum((double) wal_stats->wal_write_time / 1000.0);
+	values[7] = Float8GetDatum((double) wal_stats->wal_sync_time / 1000.0);
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 442850e8ad..da05b5acb1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee06528bb0..c6483fa1ff 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,6 +586,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7ba7c2ff8a..933659efc6 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5546,9 +5546,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..7dd375a114 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,12 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_write_time;	/* time spend writing wal records in
+										 * micro seconds */
+	PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
+									 * seconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +845,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -1590,7 +1600,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_report_wal(void);
+extern void pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to

v14-0003-Add-shutdown-hooks-to-send-statistics.patchtext/x-diff; name=v14-0003-Add-shutdown-hooks-to-send-statistics.patchDownload

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3894f4a270..a9c0090eef 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -168,6 +168,8 @@ static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
+static void pgstat_beshutdown_hook(int code, Datum arg);
+static void pgstat_send_checkpointer(void);
 
 /* Signal handlers */
 static void ReqCheckpointHandler(SIGNAL_ARGS);
@@ -187,6 +189,9 @@ CheckpointerMain(void)
 
 	CheckpointerShmem->checkpointer_pid = MyProcPid;
 
+	/* Arrange to send statistics to the stats collector at checkpointer exit */
+	on_shmem_exit(pgstat_beshutdown_hook, 0);
+
 	/*
 	 * Properly accept or ignore signals the postmaster might send us
 	 *
@@ -495,17 +500,8 @@ CheckpointerMain(void)
 		/* Check for archive_timeout and switch xlog files if necessary. */
 		CheckArchiveTimeout();
 
-		/*
-		 * Send off activity statistics to the stats collector.  (The reason
-		 * why we re-use bgwriter-related code for this is that the bgwriter
-		 * and checkpointer used to be just one process.  It's probably not
-		 * worth the trouble to split the stats support into two independent
-		 * stats message types.)
-		 */
-		pgstat_send_bgwriter();
-
-		/* Send WAL statistics to the stats collector. */
-		pgstat_report_wal();
+		/* Send statistics to the stats collector */
+		pgstat_send_checkpointer();
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
@@ -1313,6 +1309,35 @@ UpdateSharedMemoryConfig(void)
 	elog(DEBUG2, "checkpointer updated shared memory configuration values");
 }
 
+/*
+ * Flush any remaining statistics counts for the checkpointer out to
+ * the collector at process exits
+ */
+static void
+pgstat_beshutdown_hook(int code, Datum arg)
+{
+	pgstat_send_checkpointer();
+}
+
+/*
+ * Send the statistics for the checkpointer to the stats collector
+ */
+static void
+pgstat_send_checkpointer(void)
+{
+	/*
+	 * Send off activity statistics to the stats collector.  (The reason why
+	 * we re-use bgwriter-related code for this is that the bgwriter and
+	 * checkpointer used to be just one process.  It's probably not worth the
+	 * trouble to split the stats support into two independent stats message
+	 * types.)
+	 */
+	pgstat_send_bgwriter();
+
+	/* Send WAL statistics to the stats collector. */
+	pgstat_report_wal();
+}
+
 /*
  * FirstCallSinceLastCheckpoint allows a process to take an action once
  * per checkpoint cycle by asynchronously checking for checkpoint completion.
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 132df29aba..55cd0154bd 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -78,6 +78,9 @@ int			WalWriterFlushAfter = 128;
 #define LOOPS_UNTIL_HIBERNATE		50
 #define HIBERNATE_FACTOR			25
 
+/* Prototypes for private functions */
+static void pgstat_beshutdown_hook(int code, Datum arg);
+
 /*
  * Main entry point for walwriter process
  *
@@ -92,6 +95,9 @@ WalWriterMain(void)
 	int			left_till_hibernate;
 	bool		hibernating;
 
+	/* Arrange to send statistics to the stats collector at walwriter exit */
+	on_shmem_exit(pgstat_beshutdown_hook, 0);
+
 	/*
 	 * Properly accept or ignore signals the postmaster might send us
 	 *
@@ -272,3 +278,14 @@ WalWriterMain(void)
 						 WAIT_EVENT_WAL_WRITER_MAIN);
 	}
 }
+
+/*
+ * Flush any remaining statistics counts for the walwriter out to
+ * the collector at process exits
+ */
+static void
+pgstat_beshutdown_hook(int code, Datum arg)
+{
+	/* Send WAL statistics to the stats collector */
+	pgstat_send_wal(false);
+}

#43

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#42)

1 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/05 8:38, Masahiro Ikeda wrote:

On 2021-03-05 01:02, Fujii Masao wrote:
On 2021/03/04 16:14, Masahiro Ikeda wrote:
On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds, some values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.
Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.
Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in pgstat_send_wal(),
Agreed.

I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?
Thanks, I thought it's better.
I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time
Yes, I fixed it.
-        case SYNC_METHOD_OPEN:
-        case SYNC_METHOD_OPEN_DSYNC:
-            /* write synced it already */
-            break;

IMO it's better to add Assert(false) here to ensure that we never reach
here, as follows. Thought?
+        case SYNC_METHOD_OPEN:
+        case SYNC_METHOD_OPEN_DSYNC:
+            /* not reachable */
+            Assert(false);
I agree.

Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should send
the stats even at its exit? Otherwise some stats can fail to be collected.
But ISTM that this issue existed from before, for example checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.

Thanks!

Seems you forgot to include the changes of expected/rules.out in 0001 patch,
and which caused the regression test to fail. Attached is the updated version
of the patch. I included expected/rules.out in it.

+	PgStat_Counter m_wal_write_time;	/* time spend writing wal records in
+										 * micro seconds */
+	PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
+									 * seconds */

IMO "spend" should be "spent". Also "micro seconds" should be "microseconds"
in sake of consistent with other comments in pgstat.h. I fixed them.

Regarding pgstat_report_wal() and pgstat_send_wal(), I found one bug. Even
when pgstat_send_wal() returned without sending any message,
pgstat_report_wal() saved current pgWalUsage and that counter was used for
the subsequent calculation of WAL usage. This caused some counters not to
be sent to the collector. This is a bug that I added. I fixed this bug.

+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;

I changed the order of the above in pgstat.c so that wal_write_time and
wal_sync_time are placed in next to each other.

The followings are the comments for the docs part. I've not updated this
in the patch yet because I'm not sure how to change them for now.

+       Number of times WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+      </para></entry>

XLogWrite() can be invoked during the functions other than XLogFlush().
For example, XLogBackgroundFlush(). So the above description might be
confusing?

+       Number of times WAL files were synced to disk via
+       <function>issue_xlog_fsync</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)

Same as above.

+       while <xref linkend="guc-wal-sync-method"/> was set to one of the
+       "sync at commit" options (i.e., <literal>fdatasync</literal>,
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).

Even open_sync and open_datasync do the sync at commit. No? I'm not sure
if "sync at commit" is right term to indicate fdatasync, fsync and
fsync_writethrough.

+ <literal>open_sync</literal>. Units are in milliseconds with microsecond resolution.

"with microsecond resolution" part is really necessary?

+   transaction records are flushed to permanent storage.
+   <function>XLogFlush</function> calls <function>XLogWrite</function> to write
+   and <function>issue_xlog_fsync</function> to flush them, which are counted as
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in
+   <xref linkend="pg-stat-wal-view"/>. On systems with high log output,

This description might cause users to misread that XLogFlush() calls
issue_xlog_fsync(). Since issue_xlog_fsync() is called by XLogWrite(),
ISTM that this description needs to be updated.

Each line in the above seems to end with a space character.
This space character should be removed.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

v14-0001-Add-statistics-related-to-write-sync-wal-records_fujii.patchtext/plain; charset=UTF-8; name=v14-0001-Add-statistics-related-to-write-sync-wal-records_fujii.patch; x-mac-creator=0; x-mac-type=0Download

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 967de73596..56eb55bab7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7450,7 +7450,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Enables timing of database I/O calls.  This parameter is off by
-        default, because it will repeatedly query the operating system for
+        default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
         measure the overhead of timing on your system.
@@ -7464,6 +7464,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        as it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3513e127b7..1520cef505 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3487,6 +3487,62 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL files were synced to disk via 
+       <function>issue_xlog_fsync</function>, which is invoked during an 
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the 
+       "sync at commit" options (i.e., <literal>fdatasync</literal>, 
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent writing WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>),
+       excluding sync time unless 
+       <xref linkend="guc-wal-sync-method"/> is either <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>. Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent syncing WAL files to disk via
+       <function>issue_xlog_fsync</function>, which is invoked during an 
+       <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the 
+       "sync at commit" options (i.e., <literal>fdatasync</literal>, 
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+       Units are in milliseconds with microsecond resolution.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f75527f764..06e4b37012 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -663,7 +663,9 @@
    the <acronym>WAL</acronym> buffers in shared memory. If there is no
    space for the new record, <function>XLogInsertRecord</function> will have
    to write (move to kernel cache) a few filled <acronym>WAL</acronym>
-   buffers. This is undesirable because <function>XLogInsertRecord</function>
+   buffers (the tally of this event is reported in 
+   <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>).
+   This is undesirable because <function>XLogInsertRecord</function> 
    is used on every database low level modification (for example, row
    insertion) at a time when an exclusive lock is held on affected
    data pages, so the operation needs to be as fast as possible.  What
@@ -672,8 +674,12 @@
    time. Normally, <acronym>WAL</acronym> buffers should be written
    and flushed by an <function>XLogFlush</function> request, which is
    made, for the most part, at transaction commit time to ensure that
-   transaction records are flushed to permanent storage. On systems
-   with high log output, <function>XLogFlush</function> requests might
+   transaction records are flushed to permanent storage. 
+   <function>XLogFlush</function> calls <function>XLogWrite</function> to write 
+   and <function>issue_xlog_fsync</function> to flush them, which are counted as 
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in 
+   <xref linkend="pg-stat-wal-view"/>. On systems with high log output, 
+   <function>XLogFlush</function> requests might
    not occur often enough to prevent <function>XLogInsertRecord</function>
    from having to do writes.  On such systems
    one should increase the number of <acronym>WAL</acronym> buffers by
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 377afb8732..18af3d4120 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2533,6 +2534,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2541,9 +2543,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure I/O timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				/*
+				 * Increment the I/O timing and the number of times WAL data
+				 * were written out to disk.
+				 */
+				if (track_wal_io_timing)
+				{
+					instr_time	duration;
+
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10524,6 +10547,20 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
 	char	   *msg = NULL;
+	instr_time	start;
+
+	/*
+	 * Quick exit if fsync is disabled or write() has already synced the WAL
+	 * file.
+	 */
+	if (!enableFsync ||
+		sync_method == SYNC_METHOD_OPEN ||
+		sync_method == SYNC_METHOD_OPEN_DSYNC)
+		return;
+
+	/* Measure I/O timing to sync the WAL file */
+	if (track_wal_io_timing)
+		INSTR_TIME_SET_CURRENT(start);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10546,7 +10583,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 #endif
 		case SYNC_METHOD_OPEN:
 		case SYNC_METHOD_OPEN_DSYNC:
-			/* write synced it already */
+			/* not reachable */
+			Assert(false);
 			break;
 		default:
 			elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
@@ -10568,6 +10606,20 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/*
+	 * Increment the I/O timing and the number of times WAL files were synced.
+	 */
+	if (track_wal_io_timing)
+	{
+		instr_time	duration;
+
+		INSTR_TIME_SET_CURRENT(duration);
+		INSTR_TIME_SUBTRACT(duration, start);
+		WalStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
+	}
+
+	WalStats.m_wal_sync++;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73a54..51ba1b5826 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1005,6 +1005,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_sync,
+        w.wal_write_time,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 76f9f98ebb..57c4d5a5d9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_send_wal();
+		pgstat_report_wal();
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..6a51c39396 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -146,8 +146,8 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_send_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_send_wal() calls, by substracting
+ * pgstat_report_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_report_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
@@ -975,7 +975,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_send_wal();
+	pgstat_report_wal();
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -3118,7 +3118,7 @@ pgstat_initialize(void)
 	}
 
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
@@ -4667,17 +4667,17 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_wal() -
+ * pgstat_report_wal() -
  *
- *		Send WAL statistics to the collector
+ * Calculate how much WAL usage counters are increased and send
+ * WAL statistics to the collector.
+ *
+ * Must be called by processes that generate WAL.
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_report_wal(void)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
-
 	WalUsage	walusage;
 
 	/*
@@ -4692,13 +4692,56 @@ pgstat_send_wal(void)
 	WalStats.m_wal_fpi = walusage.wal_fpi;
 	WalStats.m_wal_bytes = walusage.wal_bytes;
 
+	/*
+	 * Send WAL stats message to the collector.
+	 */
+	if (!pgstat_send_wal(true))
+		return;
+
+	/*
+	 * Save the current counters for the subsequent calculation of WAL usage.
+	 */
+	prevWalUsage = pgWalUsage;
+}
+
+/* ----------
+ * pgstat_send_wal() -
+ *
+ *	Send WAL statistics to the collector.
+ *
+ * If 'force' is not set, WAL stats message is only sent if enough time has
+ * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
+ *
+ * Return true if the message is sent, and false otherwise.
+ * ----------
+ */
+bool
+pgstat_send_wal(bool force)
+{
+	/* We assume this initializes to zeroes */
+	static const PgStat_MsgWal all_zeroes;
+	static TimestampTz sendTime = 0;
+
 	/*
 	 * This function can be called even if nothing at all has happened. In
 	 * this case, avoid sending a completely empty message to the stats
 	 * collector.
 	 */
 	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return;
+		return false;
+
+	if (!force)
+	{
+		TimestampTz now = GetCurrentTimestamp();
+
+		/*
+		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+		 * msec since we last sent one.
+		 */
+		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
+			return false;
+		sendTime = now;
+	}
 
 	/*
 	 * Prepare and send the message
@@ -4706,15 +4749,12 @@ pgstat_send_wal(void)
 	pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
 	pgstat_send(&WalStats, sizeof(WalStats));
 
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-
 	/*
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&WalStats, 0, sizeof(WalStats));
+
+	return true;
 }
 
 /* ----------
@@ -6892,6 +6932,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..132df29aba 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,6 +253,9 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/* Send WAL statistics to the stats collector */
+		pgstat_send_wal(false);
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..04e1a7e8b9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,14 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+	values[5] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* Convert counters from microsec to millisec for display */
+	values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
+	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3fd1a5fbe2..e337df42cb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee06528bb0..c6483fa1ff 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,6 +586,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 59d2b71ca9..4b0aadb425 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5545,9 +5545,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..5c6c8efc5a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,12 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_write_time;	/* time spent writing wal records in
+										 * microseconds */
+	PgStat_Counter m_wal_sync_time; /* time spent syncing wal records in
+									 * microseconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +845,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -1590,7 +1600,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_report_wal(void);
+extern bool pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b1c9b7bdfe..e03ef0555a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2160,8 +2160,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_sync,
+    w.wal_write_time,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,

#44

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#43)

1 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-03-05 12:47, Fujii Masao wrote:

On 2021/03/05 8:38, Masahiro Ikeda wrote:
On 2021-03-05 01:02, Fujii Masao wrote:
On 2021/03/04 16:14, Masahiro Ikeda wrote:
On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is
also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called"
or
"which normally is called" if you want to keep true to the
original)
You missed the adding the space before an opening parenthesis
here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query
the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is
also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL
receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this
event is
reported in wal_buffers_full in....) This is undesirable
because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require
explicitly
computing the sync statistics but does require computing the
write
statistics. This is because of the presence of
issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I
observe that
the XLogWrite code path calls pgstat_report_wait_*() while the
WAL
receiver path does not. It seems technically straight-forward
to
refactor here to avoid the almost-duplicated logic in the two
places,
though I suspect there may be a trade-off for not adding
another
function call to the stack given the importance of WAL
processing
(though that seems marginalized compared to the cost of
actually
writing the WAL). Or, as Fujii noted, go the other way and
don't have
any shared code between the two but instead implement the WAL
receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver
stats messages between the WAL receiver and the stats
collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats
are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process
running
at that moment. IOW, it seems strange that some values show
dynamic
stats and the others show collected stats, even though they are
in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal
view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has elapsed 
to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes
wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush()
is called.
For example, if wal_writer_delay is set to several seconds, some
values in
pg_stat_wal would be not up-to-date meaninglessly for those
seconds.
So I'm thinking to withdraw my previous comment and it's ok to
send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.
Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.
Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in
pgstat_send_wal(),
Agreed.

I didn't do so because to avoid to double check
PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks
the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?
Thanks, I thought it's better.
I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time
Yes, I fixed it.
-        case SYNC_METHOD_OPEN:
-        case SYNC_METHOD_OPEN_DSYNC:
-            /* write synced it already */
-            break;

IMO it's better to add Assert(false) here to ensure that we never
reach
here, as follows. Thought?
+        case SYNC_METHOD_OPEN:
+        case SYNC_METHOD_OPEN_DSYNC:
+            /* not reachable */
+            Assert(false);
I agree.

Even when a backend exits, it sends the stats via
pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should
send
the stats even at its exit? Otherwise some stats can fail to be
collected.
But ISTM that this issue existed from before, for example
checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to
fix
this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer in
v14-0003 patch.
Thanks!

Seems you forgot to include the changes of expected/rules.out in 0001
patch,
and which caused the regression test to fail. Attached is the updated
version
of the patch. I included expected/rules.out in it.

Sorry.

+	PgStat_Counter m_wal_write_time;	/* time spend writing wal records in
+										 * micro seconds */
+	PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in 
micro
+									 * seconds */
IMO "spend" should be "spent". Also "micro seconds" should be
"microseconds"
in sake of consistent with other comments in pgstat.h. I fixed them.

Thanks.

Regarding pgstat_report_wal() and pgstat_send_wal(), I found one bug.
Even
when pgstat_send_wal() returned without sending any message,
pgstat_report_wal() saved current pgWalUsage and that counter was used
for
the subsequent calculation of WAL usage. This caused some counters not
to
be sent to the collector. This is a bug that I added. I fixed this bug.

Thanks.

+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
I changed the order of the above in pgstat.c so that wal_write_time and
wal_sync_time are placed in next to each other.

I forgot to fix them, thanks.

The followings are the comments for the docs part. I've not updated 
this
in the patch yet because I'm not sure how to change them for now.
+       Number of times WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref
linkend="wal-configuration"/>)
+      </para></entry>

XLogWrite() can be invoked during the functions other than XLogFlush().
For example, XLogBackgroundFlush(). So the above description might be
confusing?

+       Number of times WAL files were synced to disk via
+       <function>issue_xlog_fsync</function>, which is invoked during 
an
+       <function>XLogFlush</function> request (see <xref
linkend="wal-configuration"/>)

Same as above.

Yes, why don't you remove "XLogFlush" in the above comments
because XLogWrite() description is covered in wal.sgml?

But, now it's mentioned only for backend,
I added the comments for the wal writer in the attached patch.

+       while <xref linkend="guc-wal-sync-method"/> was set to one of 
the
+       "sync at commit" options (i.e., <literal>fdatasync</literal>,
+       <literal>fsync</literal>, or 
<literal>fsync_writethrough</literal>).
Even open_sync and open_datasync do the sync at commit. No? I'm not
sure
if "sync at commit" is right term to indicate fdatasync, fsync and
fsync_writethrough.

Yes, why don't you change to the following comments?

```
while <xref linkend="guc-wal-sync-method"/> was set to one of the
options which specific fsync method is called (i.e.,
<literal>fdatasync</literal>,
<literal>fsync</literal>, or
<literal>fsync_writethrough</literal>)
```

+ <literal>open_sync</literal>. Units are in milliseconds with
microsecond resolution.

"with microsecond resolution" part is really necessary?

I removed it because blk_read_time in pg_stat_database is the same
above,
but it doesn't mention it.

+   transaction records are flushed to permanent storage.
+   <function>XLogFlush</function> calls <function>XLogWrite</function> 
to write
+   and <function>issue_xlog_fsync</function> to flush them, which are
counted as
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in
+   <xref linkend="pg-stat-wal-view"/>. On systems with high log 
output,
This description might cause users to misread that XLogFlush() calls
issue_xlog_fsync(). Since issue_xlog_fsync() is called by XLogWrite(),
ISTM that this description needs to be updated.

I understood. I fixed to mention that XLogWrite()
calls issue_xlog_fsync().

Each line in the above seems to end with a space character.
This space character should be removed.

Sorry for that. I removed it.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v15-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; name=v15-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 967de73596..56eb55bab7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7450,7 +7450,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Enables timing of database I/O calls.  This parameter is off by
-        default, because it will repeatedly query the operating system for
+        default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
         measure the overhead of timing on your system.
@@ -7464,6 +7464,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        as it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3513e127b7..9d207756b9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2216,7 +2216,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
 <programlisting>
 SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event is NOT NULL;
- pid  | wait_event_type | wait_event 
+ pid  | wait_event_type | wait_event
 ------+-----------------+------------
  2540 | Lock            | relation
  6644 | LWLock          | ProcArray
@@ -3487,6 +3487,58 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL buffers were written out to disk via
+       <function>XLogWrite</function> request (see <xref linkend="wal-configuration"/>)
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL files were synced to disk via
+       <function>issue_xlog_fsync</function> (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the
+       options which specific fsync method is called (i.e., <literal>fdatasync</literal>,
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent writing WAL buffers were written out to disk via
+       <function>XLogWrite</function> request (see <xref linkend="wal-configuration"/>),
+       excluding sync time unless
+       <xref linkend="guc-wal-sync-method"/> is either <literal>open_datasync</literal> or
+       <literal>open_sync</literal>. Units are in milliseconds.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent syncing WAL files to disk via
+       <function>issue_xlog_fsync</function> request (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the
+       options which specific fsync method is called (i.e., <literal>fdatasync</literal>,
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+       Units are in milliseconds.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f75527f764..0ce84ed733 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -270,7 +270,7 @@
 
    <para>
     The <link linkend="app-pgchecksums"><application>pg_checksums</application></link>
-    application can be used to enable or disable data checksums, as well as 
+    application can be used to enable or disable data checksums, as well as
     verify checksums, on an offline cluster.
    </para>
 
@@ -437,7 +437,11 @@
    The duration of the
    risk window is limited because a background process (the <quote>WAL
    writer</quote>) flushes unwritten <acronym>WAL</acronym> records to disk
-   every <xref linkend="guc-wal-writer-delay"/> milliseconds.
+   every <xref linkend="guc-wal-writer-delay"/> milliseconds, which calls
+   <function>XLogWrite</function> to write and <function>XLogWrite</function>
+   <function>issue_xlog_fsync</function> to flush them. They are counted as
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in
+   <xref linkend="pg-stat-wal-view"/>.
    The actual maximum duration of the risk window is three times
    <varname>wal_writer_delay</varname> because the WAL writer is
    designed to favor writing whole pages at a time during busy periods.
@@ -662,8 +666,10 @@
    <function>XLogInsertRecord</function> is used to place a new record into
    the <acronym>WAL</acronym> buffers in shared memory. If there is no
    space for the new record, <function>XLogInsertRecord</function> will have
-   to write (move to kernel cache) a few filled <acronym>WAL</acronym>
-   buffers. This is undesirable because <function>XLogInsertRecord</function>
+   to call <function>XLogWrite</function> to write (move to kernel cache) a
+   few filled <acronym>WAL</acronym> buffers (the tally of this event is reported in
+   <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>).
+   This is undesirable because <function>XLogInsertRecord</function>
    is used on every database low level modification (for example, row
    insertion) at a time when an exclusive lock is held on affected
    data pages, so the operation needs to be as fast as possible.  What
@@ -672,7 +678,11 @@
    time. Normally, <acronym>WAL</acronym> buffers should be written
    and flushed by an <function>XLogFlush</function> request, which is
    made, for the most part, at transaction commit time to ensure that
-   transaction records are flushed to permanent storage. On systems
+   transaction records are flushed to permanent storage.
+   <function>XLogFlush</function> calls <function>XLogWrite</function> to write
+   and <function>XLogWrite</function> calls <function>issue_xlog_fsync</function>
+   to flush them. They are counted as <literal>wal_write</literal> and
+   <literal>wal_sync</literal> in <xref linkend="pg-stat-wal-view"/>. On systems
    with high log output, <function>XLogFlush</function> requests might
    not occur often enough to prevent <function>XLogInsertRecord</function>
    from having to do writes.  On such systems
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fe56324439..24c3dd32f8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2533,6 +2534,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2541,9 +2543,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure I/O timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				/*
+				 * Increment the I/O timing and the number of times WAL data
+				 * were written out to disk.
+				 */
+				if (track_wal_io_timing)
+				{
+					instr_time	duration;
+
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10526,6 +10549,20 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
 	char	   *msg = NULL;
+	instr_time	start;
+
+	/*
+	 * Quick exit if fsync is disabled or write() has already synced the WAL
+	 * file.
+	 */
+	if (!enableFsync ||
+		sync_method == SYNC_METHOD_OPEN ||
+		sync_method == SYNC_METHOD_OPEN_DSYNC)
+		return;
+
+	/* Measure I/O timing to sync the WAL file */
+	if (track_wal_io_timing)
+		INSTR_TIME_SET_CURRENT(start);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10548,7 +10585,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 #endif
 		case SYNC_METHOD_OPEN:
 		case SYNC_METHOD_OPEN_DSYNC:
-			/* write synced it already */
+			/* not reachable */
+			Assert(false);
 			break;
 		default:
 			elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
@@ -10570,6 +10608,20 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/*
+	 * Increment the I/O timing and the number of times WAL files were synced.
+	 */
+	if (track_wal_io_timing)
+	{
+		instr_time	duration;
+
+		INSTR_TIME_SET_CURRENT(duration);
+		INSTR_TIME_SUBTRACT(duration, start);
+		WalStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
+	}
+
+	WalStats.m_wal_sync++;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73a54..51ba1b5826 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1005,6 +1005,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_sync,
+        w.wal_write_time,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0bbeece19d..3894f4a270 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_send_wal();
+		pgstat_report_wal();
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..6a51c39396 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -146,8 +146,8 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_send_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_send_wal() calls, by substracting
+ * pgstat_report_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_report_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
@@ -975,7 +975,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_send_wal();
+	pgstat_report_wal();
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -3118,7 +3118,7 @@ pgstat_initialize(void)
 	}
 
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
@@ -4667,17 +4667,17 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_wal() -
+ * pgstat_report_wal() -
  *
- *		Send WAL statistics to the collector
+ * Calculate how much WAL usage counters are increased and send
+ * WAL statistics to the collector.
+ *
+ * Must be called by processes that generate WAL.
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_report_wal(void)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
-
 	WalUsage	walusage;
 
 	/*
@@ -4692,13 +4692,56 @@ pgstat_send_wal(void)
 	WalStats.m_wal_fpi = walusage.wal_fpi;
 	WalStats.m_wal_bytes = walusage.wal_bytes;
 
+	/*
+	 * Send WAL stats message to the collector.
+	 */
+	if (!pgstat_send_wal(true))
+		return;
+
+	/*
+	 * Save the current counters for the subsequent calculation of WAL usage.
+	 */
+	prevWalUsage = pgWalUsage;
+}
+
+/* ----------
+ * pgstat_send_wal() -
+ *
+ *	Send WAL statistics to the collector.
+ *
+ * If 'force' is not set, WAL stats message is only sent if enough time has
+ * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
+ *
+ * Return true if the message is sent, and false otherwise.
+ * ----------
+ */
+bool
+pgstat_send_wal(bool force)
+{
+	/* We assume this initializes to zeroes */
+	static const PgStat_MsgWal all_zeroes;
+	static TimestampTz sendTime = 0;
+
 	/*
 	 * This function can be called even if nothing at all has happened. In
 	 * this case, avoid sending a completely empty message to the stats
 	 * collector.
 	 */
 	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return;
+		return false;
+
+	if (!force)
+	{
+		TimestampTz now = GetCurrentTimestamp();
+
+		/*
+		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+		 * msec since we last sent one.
+		 */
+		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
+			return false;
+		sendTime = now;
+	}
 
 	/*
 	 * Prepare and send the message
@@ -4706,15 +4749,12 @@ pgstat_send_wal(void)
 	pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
 	pgstat_send(&WalStats, sizeof(WalStats));
 
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-
 	/*
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&WalStats, 0, sizeof(WalStats));
+
+	return true;
 }
 
 /* ----------
@@ -6892,6 +6932,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..132df29aba 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,6 +253,9 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/* Send WAL statistics to the stats collector */
+		pgstat_send_wal(false);
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..04e1a7e8b9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,14 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+	values[5] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* Convert counters from microsec to millisec for display */
+	values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
+	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 442850e8ad..da05b5acb1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee06528bb0..c6483fa1ff 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,6 +586,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7ba7c2ff8a..933659efc6 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5546,9 +5546,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..5c6c8efc5a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,12 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_write_time;	/* time spent writing wal records in
+										 * microseconds */
+	PgStat_Counter m_wal_sync_time; /* time spent syncing wal records in
+									 * microseconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +845,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -1590,7 +1600,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_report_wal(void);
+extern bool pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b1c9b7bdfe..e03ef0555a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2160,8 +2160,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_sync,
+    w.wal_write_time,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,

#45

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#44)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/05 19:54, Masahiro Ikeda wrote:

On 2021-03-05 12:47, Fujii Masao wrote:
On 2021/03/05 8:38, Masahiro Ikeda wrote:
On 2021-03-05 01:02, Fujii Masao wrote:
On 2021/03/04 16:14, Masahiro Ikeda wrote:
On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds, some values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.
Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.
Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in pgstat_send_wal(),
Agreed.

I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?
Thanks, I thought it's better.
I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time
Yes, I fixed it.
-        case SYNC_METHOD_OPEN:
-        case SYNC_METHOD_OPEN_DSYNC:
-            /* write synced it already */
-            break;

IMO it's better to add Assert(false) here to ensure that we never reach
here, as follows. Thought?
+        case SYNC_METHOD_OPEN:
+        case SYNC_METHOD_OPEN_DSYNC:
+            /* not reachable */
+            Assert(false);
I agree.

Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should send
the stats even at its exit? Otherwise some stats can fail to be collected.
But ISTM that this issue existed from before, for example checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
Thanks!

Seems you forgot to include the changes of expected/rules.out in 0001 patch,
and which caused the regression test to fail. Attached is the updated version
of the patch. I included expected/rules.out in it.
Sorry.
+    PgStat_Counter m_wal_write_time;    /* time spend writing wal records in
+                                         * micro seconds */
+    PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
+                                     * seconds */
IMO "spend" should be "spent". Also "micro seconds" should be "microseconds"
in sake of consistent with other comments in pgstat.h. I fixed them.
Thanks.

Regarding pgstat_report_wal() and pgstat_send_wal(), I found one bug. Even
when pgstat_send_wal() returned without sending any message,
pgstat_report_wal() saved current pgWalUsage and that counter was used for
the subsequent calculation of WAL usage. This caused some counters not to
be sent to the collector. This is a bug that I added. I fixed this bug.

Thanks.
+    walStats.wal_write += msg->m_wal_write;
+    walStats.wal_write_time += msg->m_wal_write_time;
+    walStats.wal_sync += msg->m_wal_sync;
+    walStats.wal_sync_time += msg->m_wal_sync_time;
I changed the order of the above in pgstat.c so that wal_write_time and
wal_sync_time are placed in next to each other.
I forgot to fix them, thanks.
The followings are the comments for the docs part. I've not updated this
in the patch yet because I'm not sure how to change them for now.
+       Number of times WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref
linkend="wal-configuration"/>)
+      </para></entry>
XLogWrite() can be invoked during the functions other than XLogFlush().
For example, XLogBackgroundFlush(). So the above description might be
confusing?
+       Number of times WAL files were synced to disk via
+       <function>issue_xlog_fsync</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref
linkend="wal-configuration"/>)
Same as above.
Yes, why don't you remove "XLogFlush" in the above comments
because XLogWrite() description is covered in wal.sgml?

But, now it's mentioned only for backend,
I added the comments for the wal writer in the attached patch.
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the
+       "sync at commit" options (i.e., <literal>fdatasync</literal>,
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
Even open_sync and open_datasync do the sync at commit. No? I'm not sure
if "sync at commit" is right term to indicate fdatasync, fsync and
fsync_writethrough.
Yes, why don't you change to the following comments?

```
       while <xref linkend="guc-wal-sync-method"/> was set to one of the
       options which specific fsync method is called (i.e., <literal>fdatasync</literal>,
       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>)
```

+       <literal>open_sync</literal>. Units are in milliseconds with
microsecond resolution.

"with microsecond resolution" part is really necessary?

I removed it because blk_read_time in pg_stat_database is the same above,
but it doesn't mention it.
+   transaction records are flushed to permanent storage.
+   <function>XLogFlush</function> calls <function>XLogWrite</function> to write
+   and <function>issue_xlog_fsync</function> to flush them, which are
counted as
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in
+   <xref linkend="pg-stat-wal-view"/>. On systems with high log output,
This description might cause users to misread that XLogFlush() calls
issue_xlog_fsync(). Since issue_xlog_fsync() is called by XLogWrite(),
ISTM that this description needs to be updated.
I understood. I fixed to mention that XLogWrite()
calls issue_xlog_fsync().

Each line in the above seems to end with a space character.
This space character should be removed.

Sorry for that. I removed it.

Thanks for updating the patch! I think it's getting good shape!

- pid  | wait_event_type | wait_event
+ pid  | wait_event_type | wait_event

This change is not necessary?

-   every <xref linkend="guc-wal-writer-delay"/> milliseconds.
+   every <xref linkend="guc-wal-writer-delay"/> milliseconds, which calls
+   <function>XLogWrite</function> to write and <function>XLogWrite</function>
+   <function>issue_xlog_fsync</function> to flush them. They are counted as
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in
+   <xref linkend="pg-stat-wal-view"/>.

Isn't it better to avoid using the terms like XLogWrite or issue_xlog_fsync
before explaining what they are? They are explained later. At least for me
I'm ok without this change.

-   to write (move to kernel cache) a few filled <acronym>WAL</acronym>
-   buffers. This is undesirable because <function>XLogInsertRecord</function>
+   to call <function>XLogWrite</function> to write (move to kernel cache) a
+   few filled <acronym>WAL</acronym> buffers (the tally of this event is reported in
+   <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>).
+   This is undesirable because <function>XLogInsertRecord</function>

This paragraph explains the relationshp between WAL writes and WAL buffers. I don't think it's good to add different context to this paragraph. Instead, what about adding new paragraph like the follwing?

----------------------------------
When track_wal_io_timing is enabled, the total amounts of time XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are counted as wal_write_time and wal_sync_time in pg_stat_wal view, respectively. XLogWrite is normally called by XLogInsertRecord (when there is no space for the new record in WAL buffers), XLogFlush and the WAL writer, to write WAL buffers to disk and call issue_xlog_fsync. If wal_sync_method is either open_datasync or open_sync, a write operation in XLogWrite guarantees to sync written WAL data to disk and issue_xlog_fsync does nothing. If wal_sync_method is either fdatasync, fsync, or fsync_writethrough, the write operation moves WAL buffer to kernel cache and issue_xlog_fsync syncs WAL files to disk. Regardless of the setting of track_wal_io_timing, the numbers of times XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are also counted as wal_write and wal_sync in pg_stat_wal, respectively.
----------------------------------

+ <function>issue_xlog_fsync</function> (see <xref linkend="wal-configuration"/>)

"request" should be place just before "(see"?

+       Number of times WAL files were synced to disk via
+       <function>issue_xlog_fsync</function> (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the
+       options which specific fsync method is called (i.e., <literal>fdatasync</literal>,
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).

Isn't it better to mention the case of fsync=off? What about the following?

----------------------------------
Number of times WAL files were synced to disk via issue_xlog_fsync (see ...). This is zero when fsync is off or wal_sync_method is either open_datasync or open_sync.
----------------------------------

+ Total amount of time spent writing WAL buffers were written out to disk via

"were written out" is not necessary?

+       Total amount of time spent syncing WAL files to disk via
+       <function>issue_xlog_fsync</function> request (see <xref linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the
+       options which specific fsync method is called (i.e., <literal>fdatasync</literal>,
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+       Units are in milliseconds.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.

Isn't it better to explain the case where this counter is zero a bit more clearly as follows?

---------------------
This is zero when track_wal_io_timing is disabled, fsync is off, or wal_sync_method is either open_datasync or open_sync.
---------------------

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#46

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#45)

1 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-03-08 13:44, Fujii Masao wrote:

On 2021/03/05 19:54, Masahiro Ikeda wrote:
On 2021-03-05 12:47, Fujii Masao wrote:
On 2021/03/05 8:38, Masahiro Ikeda wrote:
On 2021-03-05 01:02, Fujii Masao wrote:
On 2021/03/04 16:14, Masahiro Ikeda wrote:
On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during
an
<function>XLogFlush</function> request (see ...). This is
also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally
called" or
"which normally is called" if you want to keep true to the
original)
You missed the adding the space before an opening
parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly
query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This
is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL
receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this
event is
reported in wal_buffers_full in....) This is undesirable
because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require
explicitly
computing the sync statistics but does require computing the
write
statistics. This is because of the presence of
issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I
observe that
the XLogWrite code path calls pgstat_report_wait_*() while
the WAL
receiver path does not. It seems technically
straight-forward to
refactor here to avoid the almost-duplicated logic in the
two places,
though I suspect there may be a trade-off for not adding
another
function call to the stack given the importance of WAL
processing
(though that seems marginalized compared to the cost of
actually
writing the WAL). Or, as Fujii noted, go the other way and
don't have
any shared code between the two but instead implement the
WAL receiver
one to use pg_stat_wal_receiver instead. In either case,
this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver
stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL
receiver stats messages between the WAL receiver and the
stats collector, and
the stats for WAL receiver is counted in
pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those
stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver
process running
at that moment. IOW, it seems strange that some values show
dynamic
stats and the others show collected stats, even though they
are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in
pg_stat_wal view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now 
*/
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or
open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has 
elapsed to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes
wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush()
is called.
For example, if wal_writer_delay is set to several seconds, some
values in
pg_stat_wal would be not up-to-date meaninglessly for those
seconds.
So I'm thinking to withdraw my previous comment and it's ok to
send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.
Yeah, if wal_writer_delay is set to very small value, there is a
risk
that the WAL stats are sent too frequently. I agree that's a
problem.
Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in
pgstat_send_wal(),
Agreed.

I didn't do so because to avoid to double check
PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already
checks the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?
Thanks, I thought it's better.
I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time
Yes, I fixed it.
-        case SYNC_METHOD_OPEN:
-        case SYNC_METHOD_OPEN_DSYNC:
-            /* write synced it already */
-            break;

IMO it's better to add Assert(false) here to ensure that we never
reach
here, as follows. Thought?
+        case SYNC_METHOD_OPEN:
+        case SYNC_METHOD_OPEN_DSYNC:
+            /* not reachable */
+            Assert(false);
I agree.

Even when a backend exits, it sends the stats via
pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should
send
the stats even at its exit? Otherwise some stats can fail to be
collected.
But ISTM that this issue existed from before, for example
checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill
to fix
this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer in
v14-0003 patch.
Thanks!

Seems you forgot to include the changes of expected/rules.out in 0001
patch,
and which caused the regression test to fail. Attached is the updated
version
of the patch. I included expected/rules.out in it.
Sorry.
+    PgStat_Counter m_wal_write_time;    /* time spend writing wal 
records in
+                                         * micro seconds */
+    PgStat_Counter m_wal_sync_time; /* time spend syncing wal 
records in micro
+                                     * seconds */
IMO "spend" should be "spent". Also "micro seconds" should be
"microseconds"
in sake of consistent with other comments in pgstat.h. I fixed them.
Thanks.

Regarding pgstat_report_wal() and pgstat_send_wal(), I found one bug.
Even
when pgstat_send_wal() returned without sending any message,
pgstat_report_wal() saved current pgWalUsage and that counter was
used for
the subsequent calculation of WAL usage. This caused some counters
not to
be sent to the collector. This is a bug that I added. I fixed this
bug.

Thanks.
+    walStats.wal_write += msg->m_wal_write;
+    walStats.wal_write_time += msg->m_wal_write_time;
+    walStats.wal_sync += msg->m_wal_sync;
+    walStats.wal_sync_time += msg->m_wal_sync_time;
I changed the order of the above in pgstat.c so that wal_write_time
and
wal_sync_time are placed in next to each other.
I forgot to fix them, thanks.
The followings are the comments for the docs part. I've not updated 
this
in the patch yet because I'm not sure how to change them for now.
+       Number of times WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref
linkend="wal-configuration"/>)
+      </para></entry>
XLogWrite() can be invoked during the functions other than
XLogFlush().
For example, XLogBackgroundFlush(). So the above description might be
confusing?
+       Number of times WAL files were synced to disk via
+       <function>issue_xlog_fsync</function>, which is invoked 
during an
+       <function>XLogFlush</function> request (see <xref
linkend="wal-configuration"/>)
Same as above.
Yes, why don't you remove "XLogFlush" in the above comments
because XLogWrite() description is covered in wal.sgml?

But, now it's mentioned only for backend,
I added the comments for the wal writer in the attached patch.
+       while <xref linkend="guc-wal-sync-method"/> was set to one of 
the
+       "sync at commit" options (i.e., <literal>fdatasync</literal>,
+       <literal>fsync</literal>, or 
<literal>fsync_writethrough</literal>).
Even open_sync and open_datasync do the sync at commit. No? I'm not
sure
if "sync at commit" is right term to indicate fdatasync, fsync and
fsync_writethrough.
Yes, why don't you change to the following comments?

```
       while <xref linkend="guc-wal-sync-method"/> was set to one of
the
       options which specific fsync method is called (i.e.,
<literal>fdatasync</literal>,
       <literal>fsync</literal>, or
<literal>fsync_writethrough</literal>)
```

+       <literal>open_sync</literal>. Units are in milliseconds with
microsecond resolution.

"with microsecond resolution" part is really necessary?

I removed it because blk_read_time in pg_stat_database is the same
above,
but it doesn't mention it.
+   transaction records are flushed to permanent storage.
+   <function>XLogFlush</function> calls 
<function>XLogWrite</function> to write
+   and <function>issue_xlog_fsync</function> to flush them, which 
are
counted as
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in
+   <xref linkend="pg-stat-wal-view"/>. On systems with high log 
output,
This description might cause users to misread that XLogFlush() calls
issue_xlog_fsync(). Since issue_xlog_fsync() is called by
XLogWrite(),
ISTM that this description needs to be updated.
I understood. I fixed to mention that XLogWrite()
calls issue_xlog_fsync().

Each line in the above seems to end with a space character.
This space character should be removed.

Sorry for that. I removed it.
Thanks for updating the patch! I think it's getting good shape!
- pid  | wait_event_type | wait_event
+ pid  | wait_event_type | wait_event
This change is not necessary?

No, sorry.
I removed it by mistake when I remove trailing space characters.

-   every <xref linkend="guc-wal-writer-delay"/> milliseconds.
+   every <xref linkend="guc-wal-writer-delay"/> milliseconds, which 
calls
+   <function>XLogWrite</function> to write and 
<function>XLogWrite</function>
+   <function>issue_xlog_fsync</function> to flush them. They are 
counted as
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in
+   <xref linkend="pg-stat-wal-view"/>.
Isn't it better to avoid using the terms like XLogWrite or
issue_xlog_fsync
before explaining what they are? They are explained later. At least for
me
I'm ok without this change.

OK. I removed them and add a new paragraph.

-   to write (move to kernel cache) a few filled <acronym>WAL</acronym>
-   buffers. This is undesirable because 
<function>XLogInsertRecord</function>
+   to call <function>XLogWrite</function> to write (move to kernel 
cache) a
+   few filled <acronym>WAL</acronym> buffers (the tally of this event
is reported in
+   <literal>wal_buffers_full</literal> in <xref 
linkend="pg-stat-wal-view"/>).
+   This is undesirable because <function>XLogInsertRecord</function>
This paragraph explains the relationshp between WAL writes and WAL
buffers. I don't think it's good to add different context to this
paragraph. Instead, what about adding new paragraph like the follwing?

----------------------------------
When track_wal_io_timing is enabled, the total amounts of time
XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are
counted as wal_write_time and wal_sync_time in pg_stat_wal view,
respectively. XLogWrite is normally called by XLogInsertRecord (when
there is no space for the new record in WAL buffers), XLogFlush and
the WAL writer, to write WAL buffers to disk and call
issue_xlog_fsync. If wal_sync_method is either open_datasync or
open_sync, a write operation in XLogWrite guarantees to sync written
WAL data to disk and issue_xlog_fsync does nothing. If wal_sync_method
is either fdatasync, fsync, or fsync_writethrough, the write operation
moves WAL buffer to kernel cache and issue_xlog_fsync syncs WAL files
to disk. Regardless of the setting of track_wal_io_timing, the numbers
of times XLogWrite writes and issue_xlog_fsync syncs WAL data to disk
are also counted as wal_write and wal_sync in pg_stat_wal,
respectively.
----------------------------------

Thanks, I agree it's better.

+ <function>issue_xlog_fsync</function> (see <xref
linkend="wal-configuration"/>)

"request" should be place just before "(see"?

Yes, thanks.

+       Number of times WAL files were synced to disk via
+       <function>issue_xlog_fsync</function> (see <xref
linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of 
the
+       options which specific fsync method is called (i.e.,
<literal>fdatasync</literal>,
+       <literal>fsync</literal>, or 
<literal>fsync_writethrough</literal>).
Isn't it better to mention the case of fsync=off? What about the
following?

----------------------------------
Number of times WAL files were synced to disk via issue_xlog_fsync
(see ...). This is zero when fsync is off or wal_sync_method is either
open_datasync or open_sync.
----------------------------------

Yes.

+ Total amount of time spent writing WAL buffers were written
out to disk via

"were written out" is not necessary?

Yes, removed it.

+       Total amount of time spent syncing WAL files to disk via
+       <function>issue_xlog_fsync</function> request (see <xref
linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of 
the
+       options which specific fsync method is called (i.e.,
<literal>fdatasync</literal>,
+       <literal>fsync</literal>, or 
<literal>fsync_writethrough</literal>).
+       Units are in milliseconds.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is 
disabled.
Isn't it better to explain the case where this counter is zero a bit
more clearly as follows?

---------------------
This is zero when track_wal_io_timing is disabled, fsync is off, or
wal_sync_method is either open_datasync or open_sync.
---------------------

Yes, thanks.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v16-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; name=v16-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 967de73596..56eb55bab7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7450,7 +7450,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Enables timing of database I/O calls.  This parameter is off by
-        default, because it will repeatedly query the operating system for
+        default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
         measure the overhead of timing on your system.
@@ -7464,6 +7464,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        as it will repeatedly query the operating system for
+        the current time, which may cause significant overhead on some
+        platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3513e127b7..a8506d0486 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3487,6 +3487,57 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL buffers were written out to disk via
+       <function>XLogWrite</function> request (see <xref linkend="wal-configuration"/>)
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL files were synced to disk via
+       <function>issue_xlog_fsync</function> request (see <xref linkend="wal-configuration"/>).
+       This is zero when <xref linkend="guc-fsync"/> is off or 
+       <xref linkend="guc-wal-sync-method"/> is either <literal>open_datasync</literal>
+       or <literal>open_sync</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent writing WAL buffers to disk via
+       <function>XLogWrite</function> request (see <xref linkend="wal-configuration"/>),
+       excluding sync time unless
+       <xref linkend="guc-wal-sync-method"/> is either <literal>open_datasync</literal> or
+       <literal>open_sync</literal>. Units are in milliseconds.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent syncing WAL files to disk via
+       <function>issue_xlog_fsync</function> request (see <xref linkend="wal-configuration"/>).
+       Units are in milliseconds.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled,
+       <xref linkend="guc-fsync"/> is off, or <xref linkend="guc-wal-sync-method"/> is
+       either <literal>open_datasync</literal> or <literal>open_sync</literal>.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f75527f764..a3c7e0d26c 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -767,6 +767,32 @@
    <acronym>WAL</acronym> call being logged to the server log. This
    option might be replaced by a more general mechanism in the future.
   </para>
+
+  <para>
+   When <xref linkend="guc-track-wal-io-timing"/> is enabled, the total
+   amounts of time <function>XLogWrite</function> writes and
+   <function>issue_xlog_fsync</function> syncs WAL data to disk are
+   counted as <literal>wal_write_time</literal> and
+   <literal>wal_sync_time</literal> in 
+   <xref linkend="pg-stat-wal-view"/>, respectively.
+   <function>XLogWrite</function> is normally called by 
+   <function>XLogInsertRecord</function> (when there is no space for 
+   the new record in WAL buffers), <function>XLogFlush</function> and
+   the WAL writer, to write WAL buffers to disk and call
+   <function>issue_xlog_fsync</function>. If <xref linkend="guc-wal-sync-method"/>
+   is either <literal>open_datasync</literal> or <literal>open_sync</literal>,
+   a write operation in <function>XLogWrite</function> guarantees to sync written
+   WAL data to disk and <function>issue_xlog_fsync</function> does nothing.
+   If <xref linkend="guc-wal-sync-method"/> is either <literal>fdatasync</literal>,
+   <literal>fsync</literal>, or <literal>fsync_writethrough</literal>,
+   the write operation moves WAL buffer to kernel cache and
+   <function>issue_xlog_fsync</function> syncs WAL files to disk. Regardless
+   of the setting of <xref linkend="guc-track-wal-io-timing"/>, the numbers
+   of times <function>XLogWrite</function> writes and
+   <function>issue_xlog_fsync</function> syncs WAL data to disk are also
+   counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
+   in <xref linkend="pg-stat-wal-view"/>, respectively.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fe56324439..24c3dd32f8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2533,6 +2534,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2541,9 +2543,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure I/O timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				/*
+				 * Increment the I/O timing and the number of times WAL data
+				 * were written out to disk.
+				 */
+				if (track_wal_io_timing)
+				{
+					instr_time	duration;
+
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10526,6 +10549,20 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
 	char	   *msg = NULL;
+	instr_time	start;
+
+	/*
+	 * Quick exit if fsync is disabled or write() has already synced the WAL
+	 * file.
+	 */
+	if (!enableFsync ||
+		sync_method == SYNC_METHOD_OPEN ||
+		sync_method == SYNC_METHOD_OPEN_DSYNC)
+		return;
+
+	/* Measure I/O timing to sync the WAL file */
+	if (track_wal_io_timing)
+		INSTR_TIME_SET_CURRENT(start);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10548,7 +10585,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 #endif
 		case SYNC_METHOD_OPEN:
 		case SYNC_METHOD_OPEN_DSYNC:
-			/* write synced it already */
+			/* not reachable */
+			Assert(false);
 			break;
 		default:
 			elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
@@ -10570,6 +10608,20 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/*
+	 * Increment the I/O timing and the number of times WAL files were synced.
+	 */
+	if (track_wal_io_timing)
+	{
+		instr_time	duration;
+
+		INSTR_TIME_SET_CURRENT(duration);
+		INSTR_TIME_SUBTRACT(duration, start);
+		WalStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
+	}
+
+	WalStats.m_wal_sync++;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73a54..51ba1b5826 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1005,6 +1005,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_sync,
+        w.wal_write_time,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0bbeece19d..3894f4a270 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_send_wal();
+		pgstat_report_wal();
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..6a51c39396 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -146,8 +146,8 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_send_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_send_wal() calls, by substracting
+ * pgstat_report_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_report_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
@@ -975,7 +975,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_send_wal();
+	pgstat_report_wal();
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -3118,7 +3118,7 @@ pgstat_initialize(void)
 	}
 
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
@@ -4667,17 +4667,17 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_wal() -
+ * pgstat_report_wal() -
  *
- *		Send WAL statistics to the collector
+ * Calculate how much WAL usage counters are increased and send
+ * WAL statistics to the collector.
+ *
+ * Must be called by processes that generate WAL.
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_report_wal(void)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
-
 	WalUsage	walusage;
 
 	/*
@@ -4692,13 +4692,56 @@ pgstat_send_wal(void)
 	WalStats.m_wal_fpi = walusage.wal_fpi;
 	WalStats.m_wal_bytes = walusage.wal_bytes;
 
+	/*
+	 * Send WAL stats message to the collector.
+	 */
+	if (!pgstat_send_wal(true))
+		return;
+
+	/*
+	 * Save the current counters for the subsequent calculation of WAL usage.
+	 */
+	prevWalUsage = pgWalUsage;
+}
+
+/* ----------
+ * pgstat_send_wal() -
+ *
+ *	Send WAL statistics to the collector.
+ *
+ * If 'force' is not set, WAL stats message is only sent if enough time has
+ * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
+ *
+ * Return true if the message is sent, and false otherwise.
+ * ----------
+ */
+bool
+pgstat_send_wal(bool force)
+{
+	/* We assume this initializes to zeroes */
+	static const PgStat_MsgWal all_zeroes;
+	static TimestampTz sendTime = 0;
+
 	/*
 	 * This function can be called even if nothing at all has happened. In
 	 * this case, avoid sending a completely empty message to the stats
 	 * collector.
 	 */
 	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return;
+		return false;
+
+	if (!force)
+	{
+		TimestampTz now = GetCurrentTimestamp();
+
+		/*
+		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+		 * msec since we last sent one.
+		 */
+		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
+			return false;
+		sendTime = now;
+	}
 
 	/*
 	 * Prepare and send the message
@@ -4706,15 +4749,12 @@ pgstat_send_wal(void)
 	pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
 	pgstat_send(&WalStats, sizeof(WalStats));
 
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-
 	/*
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&WalStats, 0, sizeof(WalStats));
+
+	return true;
 }
 
 /* ----------
@@ -6892,6 +6932,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..132df29aba 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,6 +253,9 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/* Send WAL statistics to the stats collector */
+		pgstat_send_wal(false);
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..04e1a7e8b9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,14 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+	values[5] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* Convert counters from microsec to millisec for display */
+	values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
+	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 442850e8ad..da05b5acb1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee06528bb0..c6483fa1ff 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,6 +586,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d2cfe9b6a3..41ceaf1ee7 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5546,9 +5546,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..5c6c8efc5a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,12 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_write_time;	/* time spent writing wal records in
+										 * microseconds */
+	PgStat_Counter m_wal_sync_time; /* time spent syncing wal records in
+									 * microseconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +845,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -1590,7 +1600,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_report_wal(void);
+extern bool pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b1c9b7bdfe..e03ef0555a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2160,8 +2160,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_sync,
+    w.wal_write_time,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,

#47

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#46)

1 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/08 19:42, Masahiro Ikeda wrote:

On 2021-03-08 13:44, Fujii Masao wrote:
On 2021/03/05 19:54, Masahiro Ikeda wrote:
On 2021-03-05 12:47, Fujii Masao wrote:
On 2021/03/05 8:38, Masahiro Ikeda wrote:
On 2021-03-05 01:02, Fujii Masao wrote:
On 2021/03/04 16:14, Masahiro Ikeda wrote:
On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds, some values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.
Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.
Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in pgstat_send_wal(),
Agreed.

I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?
Thanks, I thought it's better.
I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time
Yes, I fixed it.
-        case SYNC_METHOD_OPEN:
-        case SYNC_METHOD_OPEN_DSYNC:
-            /* write synced it already */
-            break;

IMO it's better to add Assert(false) here to ensure that we never reach
here, as follows. Thought?
+        case SYNC_METHOD_OPEN:
+        case SYNC_METHOD_OPEN_DSYNC:
+            /* not reachable */
+            Assert(false);
I agree.

Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should send
the stats even at its exit? Otherwise some stats can fail to be collected.
But ISTM that this issue existed from before, for example checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
Thanks!

Seems you forgot to include the changes of expected/rules.out in 0001 patch,
and which caused the regression test to fail. Attached is the updated version
of the patch. I included expected/rules.out in it.
Sorry.
+    PgStat_Counter m_wal_write_time;    /* time spend writing wal records in
+                                         * micro seconds */
+    PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
+                                     * seconds */
IMO "spend" should be "spent". Also "micro seconds" should be "microseconds"
in sake of consistent with other comments in pgstat.h. I fixed them.
Thanks.

Regarding pgstat_report_wal() and pgstat_send_wal(), I found one bug. Even
when pgstat_send_wal() returned without sending any message,
pgstat_report_wal() saved current pgWalUsage and that counter was used for
the subsequent calculation of WAL usage. This caused some counters not to
be sent to the collector. This is a bug that I added. I fixed this bug.

Thanks.
+    walStats.wal_write += msg->m_wal_write;
+    walStats.wal_write_time += msg->m_wal_write_time;
+    walStats.wal_sync += msg->m_wal_sync;
+    walStats.wal_sync_time += msg->m_wal_sync_time;
I changed the order of the above in pgstat.c so that wal_write_time and
wal_sync_time are placed in next to each other.
I forgot to fix them, thanks.
The followings are the comments for the docs part. I've not updated this
in the patch yet because I'm not sure how to change them for now.
+       Number of times WAL buffers were written out to disk via
+       <function>XLogWrite</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref
linkend="wal-configuration"/>)
+      </para></entry>
XLogWrite() can be invoked during the functions other than XLogFlush().
For example, XLogBackgroundFlush(). So the above description might be
confusing?
+       Number of times WAL files were synced to disk via
+       <function>issue_xlog_fsync</function>, which is invoked during an
+       <function>XLogFlush</function> request (see <xref
linkend="wal-configuration"/>)
Same as above.
Yes, why don't you remove "XLogFlush" in the above comments
because XLogWrite() description is covered in wal.sgml?

But, now it's mentioned only for backend,
I added the comments for the wal writer in the attached patch.
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the
+       "sync at commit" options (i.e., <literal>fdatasync</literal>,
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
Even open_sync and open_datasync do the sync at commit. No? I'm not sure
if "sync at commit" is right term to indicate fdatasync, fsync and
fsync_writethrough.
Yes, why don't you change to the following comments?

```
        while <xref linkend="guc-wal-sync-method"/> was set to one of the
        options which specific fsync method is called (i.e., <literal>fdatasync</literal>,
        <literal>fsync</literal>, or <literal>fsync_writethrough</literal>)
```

+       <literal>open_sync</literal>. Units are in milliseconds with
microsecond resolution.

"with microsecond resolution" part is really necessary?

I removed it because blk_read_time in pg_stat_database is the same above,
but it doesn't mention it.
+   transaction records are flushed to permanent storage.
+   <function>XLogFlush</function> calls <function>XLogWrite</function> to write
+   and <function>issue_xlog_fsync</function> to flush them, which are
counted as
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in
+   <xref linkend="pg-stat-wal-view"/>. On systems with high log output,
This description might cause users to misread that XLogFlush() calls
issue_xlog_fsync(). Since issue_xlog_fsync() is called by XLogWrite(),
ISTM that this description needs to be updated.
I understood. I fixed to mention that XLogWrite()
calls issue_xlog_fsync().

Each line in the above seems to end with a space character.
This space character should be removed.

Sorry for that. I removed it.
Thanks for updating the patch! I think it's getting good shape!
- pid  | wait_event_type | wait_event
+ pid  | wait_event_type | wait_event
This change is not necessary?
No, sorry.
I removed it by mistake when I remove trailing space characters.
-   every <xref linkend="guc-wal-writer-delay"/> milliseconds.
+   every <xref linkend="guc-wal-writer-delay"/> milliseconds, which calls
+   <function>XLogWrite</function> to write and <function>XLogWrite</function>
+   <function>issue_xlog_fsync</function> to flush them. They are counted as
+   <literal>wal_write</literal> and <literal>wal_sync</literal> in
+   <xref linkend="pg-stat-wal-view"/>.
Isn't it better to avoid using the terms like XLogWrite or issue_xlog_fsync
before explaining what they are? They are explained later. At least for me
I'm ok without this change.
OK. I removed them and add a new paragraph.
-   to write (move to kernel cache) a few filled <acronym>WAL</acronym>
-   buffers. This is undesirable because <function>XLogInsertRecord</function>
+   to call <function>XLogWrite</function> to write (move to kernel cache) a
+   few filled <acronym>WAL</acronym> buffers (the tally of this event
is reported in
+   <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>).
+   This is undesirable because <function>XLogInsertRecord</function>
This paragraph explains the relationshp between WAL writes and WAL
buffers. I don't think it's good to add different context to this
paragraph. Instead, what about adding new paragraph like the follwing?

----------------------------------
When track_wal_io_timing is enabled, the total amounts of time
XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are
counted as wal_write_time and wal_sync_time in pg_stat_wal view,
respectively. XLogWrite is normally called by XLogInsertRecord (when
there is no space for the new record in WAL buffers), XLogFlush and
the WAL writer, to write WAL buffers to disk and call
issue_xlog_fsync. If wal_sync_method is either open_datasync or
open_sync, a write operation in XLogWrite guarantees to sync written
WAL data to disk and issue_xlog_fsync does nothing. If wal_sync_method
is either fdatasync, fsync, or fsync_writethrough, the write operation
moves WAL buffer to kernel cache and issue_xlog_fsync syncs WAL files
to disk. Regardless of the setting of track_wal_io_timing, the numbers
of times XLogWrite writes and issue_xlog_fsync syncs WAL data to disk
are also counted as wal_write and wal_sync in pg_stat_wal,
respectively.
----------------------------------
Thanks, I agree it's better.

+ <function>issue_xlog_fsync</function> (see <xref
linkend="wal-configuration"/>)

"request" should be place just before "(see"?

Yes, thanks.
+       Number of times WAL files were synced to disk via
+       <function>issue_xlog_fsync</function> (see <xref
linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the
+       options which specific fsync method is called (i.e.,
<literal>fdatasync</literal>,
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
Isn't it better to mention the case of fsync=off? What about the following?

----------------------------------
Number of times WAL files were synced to disk via issue_xlog_fsync
(see ...). This is zero when fsync is off or wal_sync_method is either
open_datasync or open_sync.
----------------------------------
Yes.

+ Total amount of time spent writing WAL buffers were written
out to disk via

"were written out" is not necessary?

Yes, removed it.
+       Total amount of time spent syncing WAL files to disk via
+       <function>issue_xlog_fsync</function> request (see <xref
linkend="wal-configuration"/>)
+       while <xref linkend="guc-wal-sync-method"/> was set to one of the
+       options which specific fsync method is called (i.e.,
<literal>fdatasync</literal>,
+       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+       Units are in milliseconds.
+       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
Isn't it better to explain the case where this counter is zero a bit
more clearly as follows?

---------------------
This is zero when track_wal_io_timing is disabled, fsync is off, or
wal_sync_method is either open_datasync or open_sync.
---------------------
Yes, thanks.

Thanks for updating the patch! I applied cosmetic changes to that.
Patch attached. Barring any objection, I will commit this version.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

v16-0001-Add-statistics-related-to-write-sync-wal-records_fujii.patchtext/plain; charset=UTF-8; name=v16-0001-Add-statistics-related-to-write-sync-wal-records_fujii.patch; x-mac-creator=0; x-mac-type=0Download

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 967de73596..529876895b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7450,7 +7450,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Enables timing of database I/O calls.  This parameter is off by
-        default, because it will repeatedly query the operating system for
+        default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
         measure the overhead of timing on your system.
@@ -7464,6 +7464,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        as it will repeatedly query the operating system for the current time,
+        which may cause significant overhead on some platforms.
+        You can use the <application>pg_test_timing</application> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is
+        displayed in <link linkend="monitoring-pg-stat-wal-view">
+        <structname>pg_stat_wal</structname></link>.  Only superusers can
+        change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3513e127b7..cb57d8e262 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -185,6 +185,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    of block read and write times.
   </para>
 
+  <para>
+   The parameter <xref linkend="guc-track-wal-io-timing"/> enables monitoring
+   of WAL write times.
+  </para>
+
   <para>
    Normally these parameters are set in <filename>postgresql.conf</filename> so
    that they apply to all server processes, but it is possible to turn
@@ -3487,6 +3492,63 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL buffers were written out to disk via
+       <function>XLogWrite</function> request.
+       See <xref linkend="wal-configuration"/> for more information about
+       internal WAL function <function>XLogWrite</function>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL files were synced to disk via
+       <function>issue_xlog_fsync</function> request
+       (if <xref linkend="guc-fsync"/> is <literal>on</literal> and
+       <xref linkend="guc-wal-sync-method"/> is either
+       <literal>fdatasync</literal>, <literal>fsync</literal> or
+       <literal>fsync_writethrough</literal>, otherwise zero).
+       See <xref linkend="wal-configuration"/> for more information about
+       internal WAL function <function>issue_xlog_fsync</function>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent writing WAL buffers to disk via
+       <function>XLogWrite</function> request, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
+       otherwise zero).  This includes the sync time when
+       <varname>wal_sync_method</varname> is either
+       <literal>open_datasync</literal> or <literal>open_sync</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent syncing WAL files to disk via
+       <function>issue_xlog_fsync</function> request, in milliseconds
+       (if <varname>track_wal_io_timing</varname> is enabled,
+       <varname>fsync</varname> is <literal>on</literal>, and
+       <varname>wal_sync_method</varname> is either
+       <literal>fdatasync</literal>, <literal>fsync</literal> or
+       <literal>fsync_writethrough</literal>, otherwise zero).
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f75527f764..ae4a3c1293 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -767,6 +767,35 @@
    <acronym>WAL</acronym> call being logged to the server log. This
    option might be replaced by a more general mechanism in the future.
   </para>
+
+  <para>
+   There are two internal functions to write WAL data to disk:
+   <function>XLogWrite</function> and <function>issue_xlog_fsync</function>.
+   When <xref linkend="guc-track-wal-io-timing"/> is enabled, the total
+   amounts of time <function>XLogWrite</function> writes and
+   <function>issue_xlog_fsync</function> syncs WAL data to disk are counted as
+   <literal>wal_write_time</literal> and <literal>wal_sync_time</literal> in
+   <xref linkend="pg-stat-wal-view"/>, respectively.
+   <function>XLogWrite</function> is normally called by 
+   <function>XLogInsertRecord</function> (when there is no space for the new
+   record in WAL buffers), <function>XLogFlush</function> and the WAL writer,
+   to write WAL buffers to disk and call <function>issue_xlog_fsync</function>.
+   <function>issue_xlog_fsync</function> is normally called by
+   <function>XLogWrite</function> to sync WAL files to disk.
+   If <varname>wal_sync_method</varname> is either
+   <literal>open_datasync</literal> or <literal>open_sync</literal>,
+   a write operation in <function>XLogWrite</function> guarantees to sync written
+   WAL data to disk and <function>issue_xlog_fsync</function> does nothing.
+   If <varname>wal_sync_method</varname> is either <literal>fdatasync</literal>,
+   <literal>fsync</literal>, or <literal>fsync_writethrough</literal>,
+   the write operation moves WAL buffers to kernel cache and
+   <function>issue_xlog_fsync</function> syncs them to disk. Regardless
+   of the setting of <varname>track_wal_io_timing</varname>, the numbers
+   of times <function>XLogWrite</function> writes and
+   <function>issue_xlog_fsync</function> syncs WAL data to disk are also
+   counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
+   in <structname>pg_stat_wal</structname>, respectively.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 377afb8732..18af3d4120 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -110,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2533,6 +2534,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2541,9 +2543,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+
+				/* Measure I/O timing to write WAL data */
+				if (track_wal_io_timing)
+					INSTR_TIME_SET_CURRENT(start);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
+
+				/*
+				 * Increment the I/O timing and the number of times WAL data
+				 * were written out to disk.
+				 */
+				if (track_wal_io_timing)
+				{
+					instr_time	duration;
+
+					INSTR_TIME_SET_CURRENT(duration);
+					INSTR_TIME_SUBTRACT(duration, start);
+					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+				}
+
+				WalStats.m_wal_write++;
+
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -10524,6 +10547,20 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
 	char	   *msg = NULL;
+	instr_time	start;
+
+	/*
+	 * Quick exit if fsync is disabled or write() has already synced the WAL
+	 * file.
+	 */
+	if (!enableFsync ||
+		sync_method == SYNC_METHOD_OPEN ||
+		sync_method == SYNC_METHOD_OPEN_DSYNC)
+		return;
+
+	/* Measure I/O timing to sync the WAL file */
+	if (track_wal_io_timing)
+		INSTR_TIME_SET_CURRENT(start);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -10546,7 +10583,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 #endif
 		case SYNC_METHOD_OPEN:
 		case SYNC_METHOD_OPEN_DSYNC:
-			/* write synced it already */
+			/* not reachable */
+			Assert(false);
 			break;
 		default:
 			elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
@@ -10568,6 +10606,20 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 
 	pgstat_report_wait_end();
+
+	/*
+	 * Increment the I/O timing and the number of times WAL files were synced.
+	 */
+	if (track_wal_io_timing)
+	{
+		instr_time	duration;
+
+		INSTR_TIME_SET_CURRENT(duration);
+		INSTR_TIME_SUBTRACT(duration, start);
+		WalStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
+	}
+
+	WalStats.m_wal_sync++;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73a54..51ba1b5826 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1005,6 +1005,10 @@ CREATE VIEW pg_stat_wal AS
         w.wal_fpi,
         w.wal_bytes,
         w.wal_buffers_full,
+        w.wal_write,
+        w.wal_sync,
+        w.wal_write_time,
+        w.wal_sync_time,
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 76f9f98ebb..57c4d5a5d9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_send_wal();
+		pgstat_report_wal();
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..6a51c39396 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -146,8 +146,8 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_send_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_send_wal() calls, by substracting
+ * pgstat_report_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_report_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
@@ -975,7 +975,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_send_wal();
+	pgstat_report_wal();
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -3118,7 +3118,7 @@ pgstat_initialize(void)
 	}
 
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
@@ -4667,17 +4667,17 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_wal() -
+ * pgstat_report_wal() -
  *
- *		Send WAL statistics to the collector
+ * Calculate how much WAL usage counters are increased and send
+ * WAL statistics to the collector.
+ *
+ * Must be called by processes that generate WAL.
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_report_wal(void)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
-
 	WalUsage	walusage;
 
 	/*
@@ -4692,13 +4692,56 @@ pgstat_send_wal(void)
 	WalStats.m_wal_fpi = walusage.wal_fpi;
 	WalStats.m_wal_bytes = walusage.wal_bytes;
 
+	/*
+	 * Send WAL stats message to the collector.
+	 */
+	if (!pgstat_send_wal(true))
+		return;
+
+	/*
+	 * Save the current counters for the subsequent calculation of WAL usage.
+	 */
+	prevWalUsage = pgWalUsage;
+}
+
+/* ----------
+ * pgstat_send_wal() -
+ *
+ *	Send WAL statistics to the collector.
+ *
+ * If 'force' is not set, WAL stats message is only sent if enough time has
+ * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
+ *
+ * Return true if the message is sent, and false otherwise.
+ * ----------
+ */
+bool
+pgstat_send_wal(bool force)
+{
+	/* We assume this initializes to zeroes */
+	static const PgStat_MsgWal all_zeroes;
+	static TimestampTz sendTime = 0;
+
 	/*
 	 * This function can be called even if nothing at all has happened. In
 	 * this case, avoid sending a completely empty message to the stats
 	 * collector.
 	 */
 	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return;
+		return false;
+
+	if (!force)
+	{
+		TimestampTz now = GetCurrentTimestamp();
+
+		/*
+		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+		 * msec since we last sent one.
+		 */
+		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
+			return false;
+		sendTime = now;
+	}
 
 	/*
 	 * Prepare and send the message
@@ -4706,15 +4749,12 @@ pgstat_send_wal(void)
 	pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
 	pgstat_send(&WalStats, sizeof(WalStats));
 
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-
 	/*
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&WalStats, 0, sizeof(WalStats));
+
+	return true;
 }
 
 /* ----------
@@ -6892,6 +6932,10 @@ pgstat_recv_wal(PgStat_MsgWal *msg, int len)
 	walStats.wal_fpi += msg->m_wal_fpi;
 	walStats.wal_bytes += msg->m_wal_bytes;
 	walStats.wal_buffers_full += msg->m_wal_buffers_full;
+	walStats.wal_write += msg->m_wal_write;
+	walStats.wal_sync += msg->m_wal_sync;
+	walStats.wal_write_time += msg->m_wal_write_time;
+	walStats.wal_sync_time += msg->m_wal_sync_time;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 4f1a8e356b..132df29aba 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,6 +253,9 @@ WalWriterMain(void)
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
 
+		/* Send WAL statistics to the stats collector */
+		pgstat_send_wal(false);
+
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
 		 * haven't done anything useful for quite some time, lengthen the
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..04e1a7e8b9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1799,7 +1799,7 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	5
+#define PG_STAT_GET_WAL_COLS	9
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS];
 	bool		nulls[PG_STAT_GET_WAL_COLS];
@@ -1820,7 +1820,15 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   NUMERICOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "wal_buffers_full",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "wal_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_sync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_write_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
+					   FLOAT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1840,7 +1848,14 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 									Int32GetDatum(-1));
 
 	values[3] = Int64GetDatum(wal_stats->wal_buffers_full);
-	values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[4] = Int64GetDatum(wal_stats->wal_write);
+	values[5] = Int64GetDatum(wal_stats->wal_sync);
+
+	/* Convert counters from microsec to millisec for display */
+	values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
+	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
+
+	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3fd1a5fbe2..e337df42cb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1485,6 +1485,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee06528bb0..c6483fa1ff 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,6 +586,7 @@
 #track_activities = on
 #track_counts = on
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024	# (change requires restart)
 #stats_temp_directory = 'pg_stat_tmp'
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1e53d9d4ca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern bool track_wal_io_timing;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 506689d8ac..2cded25efd 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5545,9 +5545,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 
 { oid => '2306', descr => 'statistics: information about SLRU caches',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..5c6c8efc5a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -474,6 +474,12 @@ typedef struct PgStat_MsgWal
 	PgStat_Counter m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
+	PgStat_Counter m_wal_write;
+	PgStat_Counter m_wal_sync;
+	PgStat_Counter m_wal_write_time;	/* time spent writing wal records in
+										 * microseconds */
+	PgStat_Counter m_wal_sync_time; /* time spent syncing wal records in
+									 * microseconds */
 } PgStat_MsgWal;
 
 /* ----------
@@ -839,6 +845,10 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_fpi;
 	uint64		wal_bytes;
 	PgStat_Counter wal_buffers_full;
+	PgStat_Counter wal_write;
+	PgStat_Counter wal_sync;
+	PgStat_Counter wal_write_time;
+	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -1590,7 +1600,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_report_wal(void);
+extern bool pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b1c9b7bdfe..e03ef0555a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2160,8 +2160,12 @@ pg_stat_wal| SELECT w.wal_records,
     w.wal_fpi,
     w.wal_bytes,
     w.wal_buffers_full,
+    w.wal_write,
+    w.wal_sync,
+    w.wal_write_time,
+    w.wal_sync_time,
     w.stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, stats_reset);
+   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, stats_reset);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,

#48

David G. Johnston

david.g.johnston@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#47)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On Mon, Mar 8, 2021 at 8:48 AM Fujii Masao <masao.fujii@oss.nttdata.com>
wrote:

Thanks for updating the patch! I applied cosmetic changes to that.
Patch attached. Barring any objection, I will commit this version.

Read over the patch and it looks good.

One minor "the" omission (in a couple of places, copy-paste style):

+       See <xref linkend="wal-configuration"/> for more information about
+       internal WAL function <function>XLogWrite</function>.

"about *the* internal WAL function"

Also, I'm not sure why you find omitting documentation that the millisecond
field has a fractional part out to microseconds to be helpful.

David J.

#49

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: David G. Johnston (#48)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/09 4:47, David G. Johnston wrote:

On Mon, Mar 8, 2021 at 8:48 AM Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>> wrote:

Thanks for updating the patch! I applied cosmetic changes to that.
Patch attached. Barring any objection, I will commit this version.

Read over the patch and it looks good.

Thanks for the review! I committed the patch.

One minor "the" omission (in a couple of places, copy-paste style):
+       See <xref linkend="wal-configuration"/> for more information about
+       internal WAL function <function>XLogWrite</function>.
"about *the* internal WAL function"

I added "the" in such two places. Thanks!

Also, I'm not sure why you find omitting documentation that the millisecond field has a fractional part out to microseconds to be helpful.

If this information should be documented, we should do that for
not only wal_write/sync_time but also other several columns,
for example, pg_stat_database.blk_write_time?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#50

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#42)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/05 8:38, Masahiro Ikeda wrote:

On 2021-03-05 01:02, Fujii Masao wrote:
On 2021/03/04 16:14, Masahiro Ikeda wrote:
On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds, some values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.
Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.
Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in pgstat_send_wal(),
Agreed.

I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?
Thanks, I thought it's better.
I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time
Yes, I fixed it.
-        case SYNC_METHOD_OPEN:
-        case SYNC_METHOD_OPEN_DSYNC:
-            /* write synced it already */
-            break;

IMO it's better to add Assert(false) here to ensure that we never reach
here, as follows. Thought?
+        case SYNC_METHOD_OPEN:
+        case SYNC_METHOD_OPEN_DSYNC:
+            /* not reachable */
+            Assert(false);
I agree.

Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should send
the stats even at its exit? Otherwise some stats can fail to be collected.
But ISTM that this issue existed from before, for example checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.

Thanks for 0003 patch!

Isn't it overkill to send the stats in the walwriter-exit-callback? IMO we can
just send the stats only when ShutdownRequestPending is true in the walwriter
main loop (maybe just before calling HandleMainLoopInterrupts()).
If we do this, we cannot send the stats when walwriter throws FATAL error.
But that's ok because FATAL error on walwriter causes the server to crash.
Thought?

Also ISTM that we don't need to use the callback for that purpose in
checkpointer because of the same reason. That is, we can send the stats
just after calling ShutdownXLOG(0, 0) in HandleCheckpointerInterrupts().
Thought?

I'm now not sure how much useful these changes are. As far as I read pgstat.c,
when shutdown is requested, the stats collector seems to exit even when
there are outstanding stats messages. So if checkpointer and walwriter send
the stats in their last cycles, those stats might not be collected.

On the other hand, I can think that sending the stats in the last cycles would
improve the situation a bit than now. So I'm inclined to apply those changes...

Of course, there is another direction; we can improve the stats collector so
that it guarantees to collect all the sent stats messages. But I'm afraid
this change might be big.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#51

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#50)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-03-09 17:51, Fujii Masao wrote:

On 2021/03/05 8:38, Masahiro Ikeda wrote:
On 2021-03-05 01:02, Fujii Masao wrote:
On 2021/03/04 16:14, Masahiro Ikeda wrote:
On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is
also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called"
or
"which normally is called" if you want to keep true to the
original)
You missed the adding the space before an opening parenthesis
here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query
the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is
also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL
receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this
event is
reported in wal_buffers_full in....) This is undesirable
because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require
explicitly
computing the sync statistics but does require computing the
write
statistics. This is because of the presence of
issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I
observe that
the XLogWrite code path calls pgstat_report_wait_*() while the
WAL
receiver path does not. It seems technically straight-forward
to
refactor here to avoid the almost-duplicated logic in the two
places,
though I suspect there may be a trade-off for not adding
another
function call to the stack given the importance of WAL
processing
(though that seems marginalized compared to the cost of
actually
writing the WAL). Or, as Fujii noted, go the other way and
don't have
any shared code between the two but instead implement the WAL
receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver
stats messages between the WAL receiver and the stats
collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats
are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process
running
at that moment. IOW, it seems strange that some values show
dynamic
stats and the others show collected stats, even though they are
in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal
view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has elapsed 
to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes
wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush()
is called.
For example, if wal_writer_delay is set to several seconds, some
values in
pg_stat_wal would be not up-to-date meaninglessly for those
seconds.
So I'm thinking to withdraw my previous comment and it's ok to
send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.
Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.
Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in
pgstat_send_wal(),
Agreed.

I didn't do so because to avoid to double check
PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks
the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?
Thanks, I thought it's better.
I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time
Yes, I fixed it.
-        case SYNC_METHOD_OPEN:
-        case SYNC_METHOD_OPEN_DSYNC:
-            /* write synced it already */
-            break;

IMO it's better to add Assert(false) here to ensure that we never
reach
here, as follows. Thought?
+        case SYNC_METHOD_OPEN:
+        case SYNC_METHOD_OPEN_DSYNC:
+            /* not reachable */
+            Assert(false);
I agree.

Even when a backend exits, it sends the stats via
pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should
send
the stats even at its exit? Otherwise some stats can fail to be
collected.
But ISTM that this issue existed from before, for example
checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to
fix
this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer in
v14-0003 patch.
Thanks for 0003 patch!

Isn't it overkill to send the stats in the walwriter-exit-callback? IMO
we can
just send the stats only when ShutdownRequestPending is true in the
walwriter
main loop (maybe just before calling HandleMainLoopInterrupts()).
If we do this, we cannot send the stats when walwriter throws FATAL
error.
But that's ok because FATAL error on walwriter causes the server to
crash.
Thought?

Thanks for your comments!
Yes, I agree.

Also ISTM that we don't need to use the callback for that purpose in
checkpointer because of the same reason. That is, we can send the stats
just after calling ShutdownXLOG(0, 0) in
HandleCheckpointerInterrupts().
Thought?

Yes, I think so too.

Since ShutdownXLOG() may create restartpoint or checkpoint,
it might generate WAL records.

I'm now not sure how much useful these changes are. As far as I read
pgstat.c,
when shutdown is requested, the stats collector seems to exit even when
there are outstanding stats messages. So if checkpointer and walwriter
send
the stats in their last cycles, those stats might not be collected.

On the other hand, I can think that sending the stats in the last
cycles would
improve the situation a bit than now. So I'm inclined to apply those
changes...

I didn't notice that. I agree this is an important aspect.
I understood there is a case that the stats collector exits before the
checkpointer
or the walwriter exits and some stats might not be collected.

Of course, there is another direction; we can improve the stats
collector so
that it guarantees to collect all the sent stats messages. But I'm
afraid
this change might be big.

For example, implement to manage background process status in shared
memory and
the stats collector collects the stats until another background process
exits?

In my understanding, the statistics are not required high accuracy,
it's ok to ignore them if the impact is not big.

If we guarantee high accuracy, another background process like
autovacuum launcher
must send the WAL stats because it accesses the system catalog and might
generate
WAL records due to HOT update even though the possibility is low.

I thought the impact is small because the time uncollected stats are
generated is
short compared to the time from startup. So, it's ok to ignore the
remaining stats
when the process exists.

BTW, I found BgWriterStats.m_timed_checkpoints is not counted in
ShutdownLOG()
and we need to count it if to collect stats before it exits.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#52

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#51)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/10 14:11, Masahiro Ikeda wrote:

On 2021-03-09 17:51, Fujii Masao wrote:
On 2021/03/05 8:38, Masahiro Ikeda wrote:
On 2021-03-05 01:02, Fujii Masao wrote:
On 2021/03/04 16:14, Masahiro Ikeda wrote:
On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds, some values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.
Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.
Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in pgstat_send_wal(),
Agreed.

I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?
Thanks, I thought it's better.
I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time
Yes, I fixed it.
-        case SYNC_METHOD_OPEN:
-        case SYNC_METHOD_OPEN_DSYNC:
-            /* write synced it already */
-            break;

IMO it's better to add Assert(false) here to ensure that we never reach
here, as follows. Thought?
+        case SYNC_METHOD_OPEN:
+        case SYNC_METHOD_OPEN_DSYNC:
+            /* not reachable */
+            Assert(false);
I agree.

Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should send
the stats even at its exit? Otherwise some stats can fail to be collected.
But ISTM that this issue existed from before, for example checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
Thanks for 0003 patch!

Isn't it overkill to send the stats in the walwriter-exit-callback? IMO we can
just send the stats only when ShutdownRequestPending is true in the walwriter
main loop (maybe just before calling HandleMainLoopInterrupts()).
If we do this, we cannot send the stats when walwriter throws FATAL error.
But that's ok because FATAL error on walwriter causes the server to crash.
Thought?
Thanks for your comments!
Yes, I agree.

Also ISTM that we don't need to use the callback for that purpose in
checkpointer because of the same reason. That is, we can send the stats
just after calling ShutdownXLOG(0, 0) in HandleCheckpointerInterrupts().
Thought?

Yes, I think so too.

Since ShutdownXLOG() may create restartpoint or checkpoint,
it might generate WAL records.

I'm now not sure how much useful these changes are. As far as I read pgstat.c,
when shutdown is requested, the stats collector seems to exit even when
there are outstanding stats messages. So if checkpointer and walwriter send
the stats in their last cycles, those stats might not be collected.

On the other hand, I can think that sending the stats in the last cycles would
improve the situation a bit than now. So I'm inclined to apply those changes...

I didn't notice that. I agree this is an important aspect.
I understood there is a case that the stats collector exits before the checkpointer
or the walwriter exits and some stats might not be collected.

IIUC the stats collector basically exits after checkpointer and walwriter exit.
But there seems no guarantee that the stats collector processes
all the messages that other processes have sent during the shutdown of
the server.

Of course, there is another direction; we can improve the stats collector so
that it guarantees to collect all the sent stats messages. But I'm afraid
this change might be big.

For example, implement to manage background process status in shared memory and
the stats collector collects the stats until another background process exits?

In my understanding, the statistics are not required high accuracy,
it's ok to ignore them if the impact is not big.

If we guarantee high accuracy, another background process like autovacuum launcher
must send the WAL stats because it accesses the system catalog and might generate
WAL records due to HOT update even though the possibility is low.

I thought the impact is small because the time uncollected stats are generated is
short compared to the time from startup. So, it's ok to ignore the remaining stats
when the process exists.

I agree that it's not worth changing lots of code to collect such stats.
But if we can implement that very simply, isn't it more worth doing
that than current situation because we may be able to collect more
accurate stats.

BTW, I found BgWriterStats.m_timed_checkpoints is not counted in ShutdownLOG()
and we need to count it if to collect stats before it exits.

Maybe m_requested_checkpoints should be incremented in that case?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#53

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#52)

5 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-03-10 17:08, Fujii Masao wrote:

On 2021/03/10 14:11, Masahiro Ikeda wrote:
On 2021-03-09 17:51, Fujii Masao wrote:
On 2021/03/05 8:38, Masahiro Ikeda wrote:
On 2021-03-05 01:02, Fujii Masao wrote:
On 2021/03/04 16:14, Masahiro Ikeda wrote:
On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during
an
<function>XLogFlush</function> request (see ...). This is
also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally
called" or
"which normally is called" if you want to keep true to the
original)
You missed the adding the space before an opening
parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly
query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This
is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL
receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this
event is
reported in wal_buffers_full in....) This is undesirable
because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require
explicitly
computing the sync statistics but does require computing the
write
statistics. This is because of the presence of
issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I
observe that
the XLogWrite code path calls pgstat_report_wait_*() while
the WAL
receiver path does not. It seems technically
straight-forward to
refactor here to avoid the almost-duplicated logic in the
two places,
though I suspect there may be a trade-off for not adding
another
function call to the stack given the importance of WAL
processing
(though that seems marginalized compared to the cost of
actually
writing the WAL). Or, as Fujii noted, go the other way and
don't have
any shared code between the two but instead implement the
WAL receiver
one to use pg_stat_wal_receiver instead. In either case,
this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver
stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL
receiver stats messages between the WAL receiver and the
stats collector, and
the stats for WAL receiver is counted in
pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those
stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver
process running
at that moment. IOW, it seems strange that some values show
dynamic
stats and the others show collected stats, even though they
are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in
pg_stat_wal view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now 
*/
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or
open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has 
elapsed to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes
wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush()
is called.
For example, if wal_writer_delay is set to several seconds, some
values in
pg_stat_wal would be not up-to-date meaninglessly for those
seconds.
So I'm thinking to withdraw my previous comment and it's ok to
send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.
Yeah, if wal_writer_delay is set to very small value, there is a
risk
that the WAL stats are sent too frequently. I agree that's a
problem.
Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in
pgstat_send_wal(),
Agreed.

I didn't do so because to avoid to double check
PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already
checks the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?
Thanks, I thought it's better.
I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time
Yes, I fixed it.
-        case SYNC_METHOD_OPEN:
-        case SYNC_METHOD_OPEN_DSYNC:
-            /* write synced it already */
-            break;

IMO it's better to add Assert(false) here to ensure that we never
reach
here, as follows. Thought?
+        case SYNC_METHOD_OPEN:
+        case SYNC_METHOD_OPEN_DSYNC:
+            /* not reachable */
+            Assert(false);
I agree.

Even when a backend exits, it sends the stats via
pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should
send
the stats even at its exit? Otherwise some stats can fail to be
collected.
But ISTM that this issue existed from before, for example
checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill
to fix
this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer in
v14-0003 patch.
Thanks for 0003 patch!

Isn't it overkill to send the stats in the walwriter-exit-callback?
IMO we can
just send the stats only when ShutdownRequestPending is true in the
walwriter
main loop (maybe just before calling HandleMainLoopInterrupts()).
If we do this, we cannot send the stats when walwriter throws FATAL
error.
But that's ok because FATAL error on walwriter causes the server to
crash.
Thought?
Thanks for your comments!
Yes, I agree.

Also ISTM that we don't need to use the callback for that purpose in
checkpointer because of the same reason. That is, we can send the
stats
just after calling ShutdownXLOG(0, 0) in
HandleCheckpointerInterrupts().
Thought?

Yes, I think so too.

Since ShutdownXLOG() may create restartpoint or checkpoint,
it might generate WAL records.

I'm now not sure how much useful these changes are. As far as I read
pgstat.c,
when shutdown is requested, the stats collector seems to exit even
when
there are outstanding stats messages. So if checkpointer and
walwriter send
the stats in their last cycles, those stats might not be collected.

On the other hand, I can think that sending the stats in the last
cycles would
improve the situation a bit than now. So I'm inclined to apply those
changes...

I didn't notice that. I agree this is an important aspect.
I understood there is a case that the stats collector exits before the
checkpointer
or the walwriter exits and some stats might not be collected.
IIUC the stats collector basically exits after checkpointer and
walwriter exit.
But there seems no guarantee that the stats collector processes
all the messages that other processes have sent during the shutdown of
the server.

Thanks, I understood the above postmaster behaviors.

PMState manages the status and after checkpointer is exited, the
postmaster sends
SIGQUIT signal to the stats collector if the shutdown mode is smart or
fast.
(IIUC, although the postmaster kill the walsender, the archiver and
the stats collector at the same time, it's ok because the walsender
and the archiver doesn't send stats to the stats collector now.)

But, there might be a corner case to lose stats sent by background
workers like
the checkpointer before they exit (although this is not implemented
yet.)

For example,

1. checkpointer send the stats before it exit
2. stats collector receive the signal and break before processing
the stats message from checkpointer. In this case, 1's message is
lost.
3. stats collector writes the stats in the statsfiles and exit

Why don't you recheck the coming message is zero just before the 2th
procedure?
(v17-0004-guarantee-to-collect-last-stats-messages.patch)

I measured the timing of the above in my linux laptop using
v17-measure-timing.patch.
I don't have any strong opinion to handle this case since this result
shows to receive and processes
the messages takes too short time (less than 1ms) although the stats
collector receives the shutdown
signal in 5msec(099->104) after the checkpointer process exits.

```
1615421204.556 [checkpointer] DEBUG: received shutdown request signal
1615421208.099 [checkpointer] DEBUG: proc_exit(-1): 0 callbacks to make
# exit and send the messages
1615421208.099 [stats collector] DEBUG: process BGWRITER stats message
# receive and process the messages
1615421208.099 [stats collector] DEBUG: process WAL stats message
1615421208.104 [postmaster] DEBUG: reaping dead processes
1615421208.104 [stats collector] DEBUG: received shutdown request
signal # receive shutdown request from the postmaster
```

Of course, there is another direction; we can improve the stats
collector so
that it guarantees to collect all the sent stats messages. But I'm
afraid
this change might be big.

For example, implement to manage background process status in shared
memory and
the stats collector collects the stats until another background
process exits?

In my understanding, the statistics are not required high accuracy,
it's ok to ignore them if the impact is not big.

If we guarantee high accuracy, another background process like
autovacuum launcher
must send the WAL stats because it accesses the system catalog and
might generate
WAL records due to HOT update even though the possibility is low.

I thought the impact is small because the time uncollected stats are
generated is
short compared to the time from startup. So, it's ok to ignore the
remaining stats
when the process exists.

I agree that it's not worth changing lots of code to collect such
stats.
But if we can implement that very simply, isn't it more worth doing
that than current situation because we may be able to collect more
accurate stats.

Yes, I agree.
I attached the patch to send the stats before the wal writer and the
checkpointer exit.
(v17-0001-send-stats-for-walwriter-when-shutdown.patch,
v17-0002-send-stats-for-checkpointer-when-shutdown.patch)

BTW, I found BgWriterStats.m_timed_checkpoints is not counted in
ShutdownLOG()
and we need to count it if to collect stats before it exits.

Maybe m_requested_checkpoints should be incremented in that case?

I thought this should be incremented
because it invokes the methods with CHECKPOINT_IS_SHUTDOWN.

```ShutdownXLOG()
CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
```

I fixed in v17-0002-send-stats-for-checkpointer-when-shutdown.patch.

In addition, I rebased the patch for WAL receiver.
(v17-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v17-0001-send-stats-for-walwriter-when-shutdown.patchtext/x-diff; name=v17-0001-send-stats-for-walwriter-when-shutdown.patchDownload

diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 132df29aba..45c8531ac8 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -78,6 +78,9 @@ int			WalWriterFlushAfter = 128;
 #define LOOPS_UNTIL_HIBERNATE		50
 #define HIBERNATE_FACTOR			25
 
+/* Prototypes for private functions */
+static void HandleWalWriterInterrupts(void);
+
 /*
  * Main entry point for walwriter process
  *
@@ -242,7 +245,7 @@ WalWriterMain(void)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
-		HandleMainLoopInterrupts();
+		HandleWalWriterInterrupts();
 
 		/*
 		 * Do what we're here for; then, if XLogBackgroundFlush() found useful
@@ -272,3 +275,34 @@ WalWriterMain(void)
 						 WAIT_EVENT_WAL_WRITER_MAIN);
 	}
 }
+
+/*
+ * interrupt handler for main loops of WAL writer processes.
+ */
+static void
+HandleWalWriterInterrupts(void)
+{
+	if (ProcSignalBarrierPending)
+		ProcessProcSignalBarrier();
+
+	if (ConfigReloadPending)
+	{
+		ConfigReloadPending = false;
+		ProcessConfigFile(PGC_SIGHUP);
+	}
+
+	if (ShutdownRequestPending)
+	{
+		/*
+		 * Force to send remaining WAL statistics to the stats collector at
+		 * process exits.
+		 *
+		 * Since pgstat_send_wal is invoked with 'force' is false in main loop
+		 * to avoid overloading to the stats collector, there may exist unsent
+		 * stats counters for the WAL writer.
+		 */
+		pgstat_send_wal(true);
+
+		proc_exit(0);
+	}
+}

v17-0002-send-stats-for-checkpointer-when-shutdown.patchtext/x-diff; name=v17-0002-send-stats-for-checkpointer-when-shutdown.patchDownload

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3894f4a270..20454d4040 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -168,6 +168,7 @@ static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
+static void pgstat_send_checkpointer(void);
 
 /* Signal handlers */
 static void ReqCheckpointHandler(SIGNAL_ARGS);
@@ -495,17 +496,8 @@ CheckpointerMain(void)
 		/* Check for archive_timeout and switch xlog files if necessary. */
 		CheckArchiveTimeout();
 
-		/*
-		 * Send off activity statistics to the stats collector.  (The reason
-		 * why we re-use bgwriter-related code for this is that the bgwriter
-		 * and checkpointer used to be just one process.  It's probably not
-		 * worth the trouble to split the stats support into two independent
-		 * stats message types.)
-		 */
-		pgstat_send_bgwriter();
-
-		/* Send WAL statistics to the stats collector. */
-		pgstat_report_wal();
+		/* Send the statistics for the checkpointer to the stats collector */
+		pgstat_send_checkpointer();
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
@@ -572,8 +564,18 @@ HandleCheckpointerInterrupts(void)
 		 * back to the sigsetjmp block above
 		 */
 		ExitOnAnyError = true;
-		/* Close down the database */
+
+		/*
+		 * Close down the database.
+		 *
+		 * Since ShutdownXLOG() creates restartpoint or checkpoint and updates
+		 * the statistics, increment the checkpoint request and send the
+		 * statistics to the stats collector.
+		 */
+		BgWriterStats.m_requested_checkpoints++;
 		ShutdownXLOG(0, 0);
+		pgstat_send_checkpointer();
+
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
 	}
@@ -1335,3 +1337,22 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * Send the statistics for the checkpointer to the stats collector
+ */
+static void
+pgstat_send_checkpointer(void)
+{
+	/*
+	 * Send off activity statistics to the stats collector.  (The reason why
+	 * we re-use bgwriter-related code for this is that the bgwriter and
+	 * checkpointer used to be just one process.  It's probably not worth the
+	 * trouble to split the stats support into two independent stats message
+	 * types.)
+	 */
+	pgstat_send_bgwriter();
+
+	/* Send WAL statistics to the stats collector. */
+	pgstat_report_wal();
+}

v17-0003-Makes-the-wal-receiver-report-WAL-statistics.patchtext/x-diff; name=v17-0003-Makes-the-wal-receiver-report-WAL-statistics.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24c3dd32f8..7bad027162 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2534,7 +2534,6 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2544,28 +2543,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-
-				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-				written = pg_pwrite(openLogFile, from, nleft, startoffset);
-				pgstat_report_wait_end();
-
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	duration;
-
-					INSTR_TIME_SET_CURRENT(duration);
-					INSTR_TIME_SUBTRACT(duration, start);
-					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
-				}
-
-				WalStats.m_wal_write++;
+				written = XLogWriteFile(openLogFile, from, nleft, startoffset);
 
 				if (written <= 0)
 				{
@@ -2705,6 +2683,46 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	}
 }
 
+/*
+ * Issue pg_pwrite to write an XLOG file.
+ *
+ * 'fd' is a file descriptor for the XLOG file to write
+ * 'buf' is a buffer starting address to write.
+ * 'nbyte' is a number of max bytes to write up.
+ * 'offset' is a offset of XLOG file to be set.
+ */
+int
+XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset)
+{
+	int written;
+	instr_time	start;
+
+	/* Measure I/O timing to write WAL data */
+	if (track_wal_io_timing)
+		INSTR_TIME_SET_CURRENT(start);
+
+	pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+	written = pg_pwrite(fd, buf, nbyte, offset);
+	pgstat_report_wait_end();
+
+	/*
+	 * Increment the I/O timing and the number of times WAL data were
+	 * written out to disk.
+	 */
+	if (track_wal_io_timing)
+	{
+		instr_time	duration;
+
+		INSTR_TIME_SET_CURRENT(duration);
+		INSTR_TIME_SUBTRACT(duration, start);
+		WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+	}
+
+	WalStats.m_wal_write++;
+
+	return written;
+}
+
 /*
  * Record the LSN for an asynchronous transaction commit/abort
  * and nudge the WALWriter if there is work for it to do.
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7810ee916c..3abd8ac93b 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -770,6 +770,9 @@ WalRcvDie(int code, Datum arg)
 	/* Ensure that all WAL records received are flushed to disk */
 	XLogWalRcvFlush(true);
 
+	/* Send WAL statistics to the stats collector before terminating */
+	pgstat_send_wal(true);
+
 	/* Mark ourselves inactive in shared memory */
 	SpinLockAcquire(&walrcv->mutex);
 	Assert(walrcv->walRcvState == WALRCV_STREAMING ||
@@ -907,6 +910,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveForceDone(xlogfname);
 				else
 					XLogArchiveNotify(xlogfname);
+
+				/*
+				 * Send WAL statistics to the stats collector when finishing
+				 * the current WAL segment file to avoid overloading it.
+				 */
+				pgstat_send_wal(false);
 			}
 			recvFile = -1;
 
@@ -928,7 +937,8 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		/* OK to write the logs */
 		errno = 0;
 
-		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+		byteswritten = XLogWriteFile(recvFile, buf, segbytes, (off_t) startoff);
+
 		if (byteswritten <= 0)
 		{
 			char		xlogfname[MAXFNAMELEN];
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1e53d9d4ca..b345de8a28 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -290,6 +290,7 @@ extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
 extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
 extern int	XLogFileOpen(XLogSegNo segno);
+extern int	XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);

v17-0004-guarantee-to-collect-last-stats-messages.patchtext/x-diff; name=v17-0004-guarantee-to-collect-last-stats-messages.patchDownload

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 68eefb9722..86b8449193 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -329,6 +329,7 @@ static void pgstat_beshutdown_hook(int code, Datum arg);
 static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
 static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
 												 Oid tableoid, bool create);
+static int	pgstat_process_message(void);
 static void pgstat_write_statsfiles(bool permanent, bool allDbs);
 static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
 static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
@@ -4810,8 +4811,6 @@ pgstat_send_slru(void)
 NON_EXEC_STATIC void
 PgstatCollectorMain(int argc, char *argv[])
 {
-	int			len;
-	PgStat_Msg	msg;
 	int			wr;
 	WaitEvent	event;
 	WaitEventSet *wes;
@@ -4896,158 +4895,10 @@ PgstatCollectorMain(int argc, char *argv[])
 				pgstat_write_statsfiles(false, false);
 
 			/*
-			 * Try to receive and process a message.  This will not block,
-			 * since the socket is set to non-blocking mode.
-			 *
-			 * XXX On Windows, we have to force pgwin32_recv to cooperate,
-			 * despite the previous use of pg_set_noblock() on the socket.
-			 * This is extremely broken and should be fixed someday.
+			 * Try to receive and process a message.
 			 */
-#ifdef WIN32
-			pgwin32_noblock = 1;
-#endif
-
-			len = recv(pgStatSock, (char *) &msg,
-					   sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-			pgwin32_noblock = 0;
-#endif
-
-			if (len < 0)
-			{
-				if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-					break;		/* out of inner loop */
-				ereport(ERROR,
-						(errcode_for_socket_access(),
-						 errmsg("could not read statistics message: %m")));
-			}
-
-			/*
-			 * We ignore messages that are smaller than our common header
-			 */
-			if (len < sizeof(PgStat_MsgHdr))
-				continue;
-
-			/*
-			 * The received length must match the length in the header
-			 */
-			if (msg.msg_hdr.m_size != len)
-				continue;
-
-			/*
-			 * O.K. - we accept this message.  Process it.
-			 */
-			switch (msg.msg_hdr.m_type)
-			{
-				case PGSTAT_MTYPE_DUMMY:
-					break;
-
-				case PGSTAT_MTYPE_INQUIRY:
-					pgstat_recv_inquiry(&msg.msg_inquiry, len);
-					break;
-
-				case PGSTAT_MTYPE_TABSTAT:
-					pgstat_recv_tabstat(&msg.msg_tabstat, len);
-					break;
-
-				case PGSTAT_MTYPE_TABPURGE:
-					pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-					break;
-
-				case PGSTAT_MTYPE_DROPDB:
-					pgstat_recv_dropdb(&msg.msg_dropdb, len);
-					break;
-
-				case PGSTAT_MTYPE_RESETCOUNTER:
-					pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-					break;
-
-				case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-					pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-												   len);
-					break;
-
-				case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-					pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-												   len);
-					break;
-
-				case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-					pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-												 len);
-					break;
-
-				case PGSTAT_MTYPE_RESETREPLSLOTCOUNTER:
-					pgstat_recv_resetreplslotcounter(&msg.msg_resetreplslotcounter,
-													 len);
-					break;
-
-				case PGSTAT_MTYPE_AUTOVAC_START:
-					pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-					break;
-
-				case PGSTAT_MTYPE_VACUUM:
-					pgstat_recv_vacuum(&msg.msg_vacuum, len);
-					break;
-
-				case PGSTAT_MTYPE_ANALYZE:
-					pgstat_recv_analyze(&msg.msg_analyze, len);
-					break;
-
-				case PGSTAT_MTYPE_ARCHIVER:
-					pgstat_recv_archiver(&msg.msg_archiver, len);
-					break;
-
-				case PGSTAT_MTYPE_BGWRITER:
-					pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-					break;
-
-				case PGSTAT_MTYPE_WAL:
-					pgstat_recv_wal(&msg.msg_wal, len);
-					break;
-
-				case PGSTAT_MTYPE_SLRU:
-					pgstat_recv_slru(&msg.msg_slru, len);
-					break;
-
-				case PGSTAT_MTYPE_FUNCSTAT:
-					pgstat_recv_funcstat(&msg.msg_funcstat, len);
-					break;
-
-				case PGSTAT_MTYPE_FUNCPURGE:
-					pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-					break;
-
-				case PGSTAT_MTYPE_RECOVERYCONFLICT:
-					pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-												 len);
-					break;
-
-				case PGSTAT_MTYPE_DEADLOCK:
-					pgstat_recv_deadlock(&msg.msg_deadlock, len);
-					break;
-
-				case PGSTAT_MTYPE_TEMPFILE:
-					pgstat_recv_tempfile(&msg.msg_tempfile, len);
-					break;
-
-				case PGSTAT_MTYPE_CHECKSUMFAILURE:
-					pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-												 len);
-					break;
-
-				case PGSTAT_MTYPE_REPLSLOT:
-					pgstat_recv_replslot(&msg.msg_replslot, len);
-					break;
-
-				case PGSTAT_MTYPE_CONNECTION:
-					pgstat_recv_connstat(&msg.msg_conn, len);
-					break;
-
-				default:
-					break;
-			}
+			if (pgstat_process_message() < 0)
+				break;			/* If an error occurred, go out of inner loop */
 		}						/* end of inner message-processing loop */
 
 		/* Sleep until there's something to do */
@@ -5077,6 +4928,21 @@ PgstatCollectorMain(int argc, char *argv[])
 			break;
 	}							/* end of outer loop */
 
+	/*
+	 * Try to receive and process remaining messages before the process exits.
+	 *
+	 * The reason is that there is no guarantee all messages were processed in
+	 * the above loop even though the stats collector is sent SIGQUIT signal
+	 * by the postmaster after other backend and background processes, which
+	 * sent their stats to the stats collector, exit if shutdown mode is smart
+	 * or fast.
+	 *
+	 * For example, there might be a case that messages are lost when there
+	 * are unprocessed messages, the postmaster send SIGQUIT signal to the
+	 * stats collector.
+	 */
+	while (pgstat_process_message() > 0);
+
 	/*
 	 * Save the final stats to reuse at next startup.
 	 */
@@ -5225,6 +5091,169 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 	return result;
 }
 
+/*
+ * Try to receive and process a message.  This will not block,
+ * since the socket is set to non-blocking mode.
+ *
+ * XXX On Windows, we have to force pgwin32_recv to cooperate,
+ * despite the previous use of pg_set_noblock() on the socket.
+ * This is extremely broken and should be fixed someday.
+ *
+ * Return the number of processed message. -1 if an error occurred.
+ */
+static int
+pgstat_process_message()
+{
+	int			len;
+	PgStat_Msg	msg;
+
+#ifdef WIN32
+	pgwin32_noblock = 1;
+#endif
+
+	len = recv(pgStatSock, (char *) &msg, sizeof(PgStat_Msg), 0);
+
+#ifdef WIN32
+	pgwin32_noblock = 0;
+#endif
+
+	if (len < 0)
+	{
+		if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
+			return -1;
+		ereport(ERROR,
+				(errcode_for_socket_access(),
+				 errmsg("could not read statistics message: %m")));
+	}
+
+	/*
+	 * We ignore messages that are smaller than our common header
+	 */
+	if (len < sizeof(PgStat_MsgHdr))
+		return 0;
+
+	/*
+	 * The received length must match the length in the header
+	 */
+	if (msg.msg_hdr.m_size != len)
+		return 0;
+
+	/*
+	 * O.K. - we accept this message.  Process it.
+	 */
+	switch (msg.msg_hdr.m_type)
+	{
+		case PGSTAT_MTYPE_DUMMY:
+			break;
+
+		case PGSTAT_MTYPE_INQUIRY:
+			pgstat_recv_inquiry(&msg.msg_inquiry, len);
+			break;
+
+		case PGSTAT_MTYPE_TABSTAT:
+			pgstat_recv_tabstat(&msg.msg_tabstat, len);
+			break;
+
+		case PGSTAT_MTYPE_TABPURGE:
+			pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
+			break;
+
+		case PGSTAT_MTYPE_DROPDB:
+			pgstat_recv_dropdb(&msg.msg_dropdb, len);
+			break;
+
+		case PGSTAT_MTYPE_RESETCOUNTER:
+			pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
+			break;
+
+		case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
+			pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
+										   len);
+			break;
+
+		case PGSTAT_MTYPE_RESETSINGLECOUNTER:
+			pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
+										   len);
+			break;
+
+		case PGSTAT_MTYPE_RESETSLRUCOUNTER:
+			pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
+										 len);
+			break;
+
+		case PGSTAT_MTYPE_RESETREPLSLOTCOUNTER:
+			pgstat_recv_resetreplslotcounter(&msg.msg_resetreplslotcounter,
+											 len);
+			break;
+
+		case PGSTAT_MTYPE_AUTOVAC_START:
+			pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
+			break;
+
+		case PGSTAT_MTYPE_VACUUM:
+			pgstat_recv_vacuum(&msg.msg_vacuum, len);
+			break;
+
+		case PGSTAT_MTYPE_ANALYZE:
+			pgstat_recv_analyze(&msg.msg_analyze, len);
+			break;
+
+		case PGSTAT_MTYPE_ARCHIVER:
+			pgstat_recv_archiver(&msg.msg_archiver, len);
+			break;
+
+		case PGSTAT_MTYPE_BGWRITER:
+			pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
+			break;
+
+		case PGSTAT_MTYPE_WAL:
+			pgstat_recv_wal(&msg.msg_wal, len);
+			break;
+
+		case PGSTAT_MTYPE_SLRU:
+			pgstat_recv_slru(&msg.msg_slru, len);
+			break;
+
+		case PGSTAT_MTYPE_FUNCSTAT:
+			pgstat_recv_funcstat(&msg.msg_funcstat, len);
+			break;
+
+		case PGSTAT_MTYPE_FUNCPURGE:
+			pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
+			break;
+
+		case PGSTAT_MTYPE_RECOVERYCONFLICT:
+			pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
+										 len);
+			break;
+
+		case PGSTAT_MTYPE_DEADLOCK:
+			pgstat_recv_deadlock(&msg.msg_deadlock, len);
+			break;
+
+		case PGSTAT_MTYPE_TEMPFILE:
+			pgstat_recv_tempfile(&msg.msg_tempfile, len);
+			break;
+
+		case PGSTAT_MTYPE_CHECKSUMFAILURE:
+			pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
+										 len);
+			break;
+
+		case PGSTAT_MTYPE_REPLSLOT:
+			pgstat_recv_replslot(&msg.msg_replslot, len);
+			break;
+
+		case PGSTAT_MTYPE_CONNECTION:
+			pgstat_recv_connstat(&msg.msg_conn, len);
+			break;
+
+		default:
+			break;
+	}
+
+	return 1;
+}
 
 /* ----------
  * pgstat_write_statsfiles() -

v17-measure-timing.patchtext/x-diff; name=v17-measure-timing.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24c3dd32f8..eed778b779 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2534,7 +2534,6 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2544,28 +2543,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-
-				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-				written = pg_pwrite(openLogFile, from, nleft, startoffset);
-				pgstat_report_wait_end();
-
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	duration;
-
-					INSTR_TIME_SET_CURRENT(duration);
-					INSTR_TIME_SUBTRACT(duration, start);
-					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
-				}
-
-				WalStats.m_wal_write++;
+				written = XLogWriteFile(openLogFile, from, nleft, startoffset);
 
 				if (written <= 0)
 				{
@@ -2705,6 +2683,46 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	}
 }
 
+/*
+ * Issue pg_pwrite to write an XLOG file.
+ *
+ * 'fd' is a file descriptor for the XLOG file to write
+ * 'buf' is a buffer starting address to write.
+ * 'nbyte' is a number of max bytes to write up.
+ * 'offset' is a offset of XLOG file to be set.
+ */
+int
+XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset)
+{
+	int			written;
+	instr_time	start;
+
+	/* Measure I/O timing to write WAL data */
+	if (track_wal_io_timing)
+		INSTR_TIME_SET_CURRENT(start);
+
+	pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+	written = pg_pwrite(fd, buf, nbyte, offset);
+	pgstat_report_wait_end();
+
+	/*
+	 * Increment the I/O timing and the number of times WAL data were written
+	 * out to disk.
+	 */
+	if (track_wal_io_timing)
+	{
+		instr_time	duration;
+
+		INSTR_TIME_SET_CURRENT(duration);
+		INSTR_TIME_SUBTRACT(duration, start);
+		WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+	}
+
+	WalStats.m_wal_write++;
+
+	return written;
+}
+
 /*
  * Record the LSN for an asynchronous transaction commit/abort
  * and nudge the WALWriter if there is work for it to do.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3894f4a270..20454d4040 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -168,6 +168,7 @@ static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
+static void pgstat_send_checkpointer(void);
 
 /* Signal handlers */
 static void ReqCheckpointHandler(SIGNAL_ARGS);
@@ -495,17 +496,8 @@ CheckpointerMain(void)
 		/* Check for archive_timeout and switch xlog files if necessary. */
 		CheckArchiveTimeout();
 
-		/*
-		 * Send off activity statistics to the stats collector.  (The reason
-		 * why we re-use bgwriter-related code for this is that the bgwriter
-		 * and checkpointer used to be just one process.  It's probably not
-		 * worth the trouble to split the stats support into two independent
-		 * stats message types.)
-		 */
-		pgstat_send_bgwriter();
-
-		/* Send WAL statistics to the stats collector. */
-		pgstat_report_wal();
+		/* Send the statistics for the checkpointer to the stats collector */
+		pgstat_send_checkpointer();
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
@@ -572,8 +564,18 @@ HandleCheckpointerInterrupts(void)
 		 * back to the sigsetjmp block above
 		 */
 		ExitOnAnyError = true;
-		/* Close down the database */
+
+		/*
+		 * Close down the database.
+		 *
+		 * Since ShutdownXLOG() creates restartpoint or checkpoint and updates
+		 * the statistics, increment the checkpoint request and send the
+		 * statistics to the stats collector.
+		 */
+		BgWriterStats.m_requested_checkpoints++;
 		ShutdownXLOG(0, 0);
+		pgstat_send_checkpointer();
+
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
 	}
@@ -1335,3 +1337,22 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+/*
+ * Send the statistics for the checkpointer to the stats collector
+ */
+static void
+pgstat_send_checkpointer(void)
+{
+	/*
+	 * Send off activity statistics to the stats collector.  (The reason why
+	 * we re-use bgwriter-related code for this is that the bgwriter and
+	 * checkpointer used to be just one process.  It's probably not worth the
+	 * trouble to split the stats support into two independent stats message
+	 * types.)
+	 */
+	pgstat_send_bgwriter();
+
+	/* Send WAL statistics to the stats collector. */
+	pgstat_report_wal();
+}
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index dd9136a942..a50e00f06b 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -108,5 +108,7 @@ SignalHandlerForShutdownRequest(SIGNAL_ARGS)
 	ShutdownRequestPending = true;
 	SetLatch(MyLatch);
 
+	elog(DEBUG3, "received shutdown request signal");
+
 	errno = save_errno;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 68eefb9722..28fc3a80f4 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -329,6 +329,7 @@ static void pgstat_beshutdown_hook(int code, Datum arg);
 static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
 static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
 												 Oid tableoid, bool create);
+static int	pgstat_process_message(void);
 static void pgstat_write_statsfiles(bool permanent, bool allDbs);
 static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
 static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
@@ -4810,8 +4811,6 @@ pgstat_send_slru(void)
 NON_EXEC_STATIC void
 PgstatCollectorMain(int argc, char *argv[])
 {
-	int			len;
-	PgStat_Msg	msg;
 	int			wr;
 	WaitEvent	event;
 	WaitEventSet *wes;
@@ -4896,158 +4895,10 @@ PgstatCollectorMain(int argc, char *argv[])
 				pgstat_write_statsfiles(false, false);
 
 			/*
-			 * Try to receive and process a message.  This will not block,
-			 * since the socket is set to non-blocking mode.
-			 *
-			 * XXX On Windows, we have to force pgwin32_recv to cooperate,
-			 * despite the previous use of pg_set_noblock() on the socket.
-			 * This is extremely broken and should be fixed someday.
+			 * Try to receive and process a message.
 			 */
-#ifdef WIN32
-			pgwin32_noblock = 1;
-#endif
-
-			len = recv(pgStatSock, (char *) &msg,
-					   sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-			pgwin32_noblock = 0;
-#endif
-
-			if (len < 0)
-			{
-				if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-					break;		/* out of inner loop */
-				ereport(ERROR,
-						(errcode_for_socket_access(),
-						 errmsg("could not read statistics message: %m")));
-			}
-
-			/*
-			 * We ignore messages that are smaller than our common header
-			 */
-			if (len < sizeof(PgStat_MsgHdr))
-				continue;
-
-			/*
-			 * The received length must match the length in the header
-			 */
-			if (msg.msg_hdr.m_size != len)
-				continue;
-
-			/*
-			 * O.K. - we accept this message.  Process it.
-			 */
-			switch (msg.msg_hdr.m_type)
-			{
-				case PGSTAT_MTYPE_DUMMY:
-					break;
-
-				case PGSTAT_MTYPE_INQUIRY:
-					pgstat_recv_inquiry(&msg.msg_inquiry, len);
-					break;
-
-				case PGSTAT_MTYPE_TABSTAT:
-					pgstat_recv_tabstat(&msg.msg_tabstat, len);
-					break;
-
-				case PGSTAT_MTYPE_TABPURGE:
-					pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-					break;
-
-				case PGSTAT_MTYPE_DROPDB:
-					pgstat_recv_dropdb(&msg.msg_dropdb, len);
-					break;
-
-				case PGSTAT_MTYPE_RESETCOUNTER:
-					pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-					break;
-
-				case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-					pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-												   len);
-					break;
-
-				case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-					pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-												   len);
-					break;
-
-				case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-					pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-												 len);
-					break;
-
-				case PGSTAT_MTYPE_RESETREPLSLOTCOUNTER:
-					pgstat_recv_resetreplslotcounter(&msg.msg_resetreplslotcounter,
-													 len);
-					break;
-
-				case PGSTAT_MTYPE_AUTOVAC_START:
-					pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-					break;
-
-				case PGSTAT_MTYPE_VACUUM:
-					pgstat_recv_vacuum(&msg.msg_vacuum, len);
-					break;
-
-				case PGSTAT_MTYPE_ANALYZE:
-					pgstat_recv_analyze(&msg.msg_analyze, len);
-					break;
-
-				case PGSTAT_MTYPE_ARCHIVER:
-					pgstat_recv_archiver(&msg.msg_archiver, len);
-					break;
-
-				case PGSTAT_MTYPE_BGWRITER:
-					pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-					break;
-
-				case PGSTAT_MTYPE_WAL:
-					pgstat_recv_wal(&msg.msg_wal, len);
-					break;
-
-				case PGSTAT_MTYPE_SLRU:
-					pgstat_recv_slru(&msg.msg_slru, len);
-					break;
-
-				case PGSTAT_MTYPE_FUNCSTAT:
-					pgstat_recv_funcstat(&msg.msg_funcstat, len);
-					break;
-
-				case PGSTAT_MTYPE_FUNCPURGE:
-					pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-					break;
-
-				case PGSTAT_MTYPE_RECOVERYCONFLICT:
-					pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-												 len);
-					break;
-
-				case PGSTAT_MTYPE_DEADLOCK:
-					pgstat_recv_deadlock(&msg.msg_deadlock, len);
-					break;
-
-				case PGSTAT_MTYPE_TEMPFILE:
-					pgstat_recv_tempfile(&msg.msg_tempfile, len);
-					break;
-
-				case PGSTAT_MTYPE_CHECKSUMFAILURE:
-					pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-												 len);
-					break;
-
-				case PGSTAT_MTYPE_REPLSLOT:
-					pgstat_recv_replslot(&msg.msg_replslot, len);
-					break;
-
-				case PGSTAT_MTYPE_CONNECTION:
-					pgstat_recv_connstat(&msg.msg_conn, len);
-					break;
-
-				default:
-					break;
-			}
+			if (pgstat_process_message() < 0)
+				break;			/* If an error occurred, go out of inner loop */
 		}						/* end of inner message-processing loop */
 
 		/* Sleep until there's something to do */
@@ -5077,6 +4928,21 @@ PgstatCollectorMain(int argc, char *argv[])
 			break;
 	}							/* end of outer loop */
 
+	/*
+	 * Try to receive and process remaining messages before the process exits.
+	 *
+	 * The reason is that there is no guarantee all messages were processed in
+	 * the above loop even though the stats collector is sent SIGQUIT signal
+	 * by the postmaster after other backend and background processes, which
+	 * sent their stats to the stats collector, exit if shutdown mode is smart
+	 * or fast.
+	 *
+	 * For example, there might be a case that messages are lost when there
+	 * are unprocessed messages, the postmaster send SIGQUIT signal to the
+	 * stats collector.
+	 */
+	while (pgstat_process_message() > 0);
+
 	/*
 	 * Save the final stats to reuse at next startup.
 	 */
@@ -5225,6 +5091,171 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 	return result;
 }
 
+/*
+ * Try to receive and process a message.  This will not block,
+ * since the socket is set to non-blocking mode.
+ *
+ * XXX On Windows, we have to force pgwin32_recv to cooperate,
+ * despite the previous use of pg_set_noblock() on the socket.
+ * This is extremely broken and should be fixed someday.
+ *
+ * Return the number of processed message. -1 if an error occurred.
+ */
+static int
+pgstat_process_message()
+{
+	int			len;
+	PgStat_Msg	msg;
+
+#ifdef WIN32
+	pgwin32_noblock = 1;
+#endif
+
+	len = recv(pgStatSock, (char *) &msg, sizeof(PgStat_Msg), 0);
+
+#ifdef WIN32
+	pgwin32_noblock = 0;
+#endif
+
+	if (len < 0)
+	{
+		if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
+			return -1;
+		ereport(ERROR,
+				(errcode_for_socket_access(),
+				 errmsg("could not read statistics message: %m")));
+	}
+
+	/*
+	 * We ignore messages that are smaller than our common header
+	 */
+	if (len < sizeof(PgStat_MsgHdr))
+		return 0;
+
+	/*
+	 * The received length must match the length in the header
+	 */
+	if (msg.msg_hdr.m_size != len)
+		return 0;
+
+	/*
+	 * O.K. - we accept this message.  Process it.
+	 */
+	switch (msg.msg_hdr.m_type)
+	{
+		case PGSTAT_MTYPE_DUMMY:
+			break;
+
+		case PGSTAT_MTYPE_INQUIRY:
+			pgstat_recv_inquiry(&msg.msg_inquiry, len);
+			break;
+
+		case PGSTAT_MTYPE_TABSTAT:
+			pgstat_recv_tabstat(&msg.msg_tabstat, len);
+			break;
+
+		case PGSTAT_MTYPE_TABPURGE:
+			pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
+			break;
+
+		case PGSTAT_MTYPE_DROPDB:
+			pgstat_recv_dropdb(&msg.msg_dropdb, len);
+			break;
+
+		case PGSTAT_MTYPE_RESETCOUNTER:
+			pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
+			break;
+
+		case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
+			pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
+										   len);
+			break;
+
+		case PGSTAT_MTYPE_RESETSINGLECOUNTER:
+			pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
+										   len);
+			break;
+
+		case PGSTAT_MTYPE_RESETSLRUCOUNTER:
+			pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
+										 len);
+			break;
+
+		case PGSTAT_MTYPE_RESETREPLSLOTCOUNTER:
+			pgstat_recv_resetreplslotcounter(&msg.msg_resetreplslotcounter,
+											 len);
+			break;
+
+		case PGSTAT_MTYPE_AUTOVAC_START:
+			pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
+			break;
+
+		case PGSTAT_MTYPE_VACUUM:
+			pgstat_recv_vacuum(&msg.msg_vacuum, len);
+			break;
+
+		case PGSTAT_MTYPE_ANALYZE:
+			pgstat_recv_analyze(&msg.msg_analyze, len);
+			break;
+
+		case PGSTAT_MTYPE_ARCHIVER:
+			pgstat_recv_archiver(&msg.msg_archiver, len);
+			break;
+
+		case PGSTAT_MTYPE_BGWRITER:
+			elog(DEBUG3, "process BGWRITER stats message");
+			pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
+			break;
+
+		case PGSTAT_MTYPE_WAL:
+			elog(DEBUG3, "process WAL stats message");
+			pgstat_recv_wal(&msg.msg_wal, len);
+			break;
+
+		case PGSTAT_MTYPE_SLRU:
+			pgstat_recv_slru(&msg.msg_slru, len);
+			break;
+
+		case PGSTAT_MTYPE_FUNCSTAT:
+			pgstat_recv_funcstat(&msg.msg_funcstat, len);
+			break;
+
+		case PGSTAT_MTYPE_FUNCPURGE:
+			pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
+			break;
+
+		case PGSTAT_MTYPE_RECOVERYCONFLICT:
+			pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
+										 len);
+			break;
+
+		case PGSTAT_MTYPE_DEADLOCK:
+			pgstat_recv_deadlock(&msg.msg_deadlock, len);
+			break;
+
+		case PGSTAT_MTYPE_TEMPFILE:
+			pgstat_recv_tempfile(&msg.msg_tempfile, len);
+			break;
+
+		case PGSTAT_MTYPE_CHECKSUMFAILURE:
+			pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
+										 len);
+			break;
+
+		case PGSTAT_MTYPE_REPLSLOT:
+			pgstat_recv_replslot(&msg.msg_replslot, len);
+			break;
+
+		case PGSTAT_MTYPE_CONNECTION:
+			pgstat_recv_connstat(&msg.msg_conn, len);
+			break;
+
+		default:
+			break;
+	}
+
+	return 1;
+}
 
 /* ----------
  * pgstat_write_statsfiles() -
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 132df29aba..45c8531ac8 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -78,6 +78,9 @@ int			WalWriterFlushAfter = 128;
 #define LOOPS_UNTIL_HIBERNATE		50
 #define HIBERNATE_FACTOR			25
 
+/* Prototypes for private functions */
+static void HandleWalWriterInterrupts(void);
+
 /*
  * Main entry point for walwriter process
  *
@@ -242,7 +245,7 @@ WalWriterMain(void)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
-		HandleMainLoopInterrupts();
+		HandleWalWriterInterrupts();
 
 		/*
 		 * Do what we're here for; then, if XLogBackgroundFlush() found useful
@@ -272,3 +275,34 @@ WalWriterMain(void)
 						 WAIT_EVENT_WAL_WRITER_MAIN);
 	}
 }
+
+/*
+ * interrupt handler for main loops of WAL writer processes.
+ */
+static void
+HandleWalWriterInterrupts(void)
+{
+	if (ProcSignalBarrierPending)
+		ProcessProcSignalBarrier();
+
+	if (ConfigReloadPending)
+	{
+		ConfigReloadPending = false;
+		ProcessConfigFile(PGC_SIGHUP);
+	}
+
+	if (ShutdownRequestPending)
+	{
+		/*
+		 * Force to send remaining WAL statistics to the stats collector at
+		 * process exits.
+		 *
+		 * Since pgstat_send_wal is invoked with 'force' is false in main loop
+		 * to avoid overloading to the stats collector, there may exist unsent
+		 * stats counters for the WAL writer.
+		 */
+		pgstat_send_wal(true);
+
+		proc_exit(0);
+	}
+}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7810ee916c..3abd8ac93b 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -770,6 +770,9 @@ WalRcvDie(int code, Datum arg)
 	/* Ensure that all WAL records received are flushed to disk */
 	XLogWalRcvFlush(true);
 
+	/* Send WAL statistics to the stats collector before terminating */
+	pgstat_send_wal(true);
+
 	/* Mark ourselves inactive in shared memory */
 	SpinLockAcquire(&walrcv->mutex);
 	Assert(walrcv->walRcvState == WALRCV_STREAMING ||
@@ -907,6 +910,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveForceDone(xlogfname);
 				else
 					XLogArchiveNotify(xlogfname);
+
+				/*
+				 * Send WAL statistics to the stats collector when finishing
+				 * the current WAL segment file to avoid overloading it.
+				 */
+				pgstat_send_wal(false);
 			}
 			recvFile = -1;
 
@@ -928,7 +937,8 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		/* OK to write the logs */
 		errno = 0;
 
-		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+		byteswritten = XLogWriteFile(recvFile, buf, segbytes, (off_t) startoff);
+
 		if (byteswritten <= 0)
 		{
 			char		xlogfname[MAXFNAMELEN];
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1e53d9d4ca..b345de8a28 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -290,6 +290,7 @@ extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
 extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
 extern int	XLogFileOpen(XLogSegNo segno);
+extern int	XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);

#54

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#53)

2 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/11 9:38, Masahiro Ikeda wrote:

On 2021-03-10 17:08, Fujii Masao wrote:
On 2021/03/10 14:11, Masahiro Ikeda wrote:
On 2021-03-09 17:51, Fujii Masao wrote:
On 2021/03/05 8:38, Masahiro Ikeda wrote:
On 2021-03-05 01:02, Fujii Masao wrote:
On 2021/03/04 16:14, Masahiro Ikeda wrote:
On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds, some values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.
Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.
Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in pgstat_send_wal(),
Agreed.

I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?
Thanks, I thought it's better.
I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time
Yes, I fixed it.
-        case SYNC_METHOD_OPEN:
-        case SYNC_METHOD_OPEN_DSYNC:
-            /* write synced it already */
-            break;

IMO it's better to add Assert(false) here to ensure that we never reach
here, as follows. Thought?
+        case SYNC_METHOD_OPEN:
+        case SYNC_METHOD_OPEN_DSYNC:
+            /* not reachable */
+            Assert(false);
I agree.

Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should send
the stats even at its exit? Otherwise some stats can fail to be collected.
But ISTM that this issue existed from before, for example checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
Thanks for 0003 patch!

Isn't it overkill to send the stats in the walwriter-exit-callback? IMO we can
just send the stats only when ShutdownRequestPending is true in the walwriter
main loop (maybe just before calling HandleMainLoopInterrupts()).
If we do this, we cannot send the stats when walwriter throws FATAL error.
But that's ok because FATAL error on walwriter causes the server to crash.
Thought?
Thanks for your comments!
Yes, I agree.

Also ISTM that we don't need to use the callback for that purpose in
checkpointer because of the same reason. That is, we can send the stats
just after calling ShutdownXLOG(0, 0) in HandleCheckpointerInterrupts().
Thought?

Yes, I think so too.

Since ShutdownXLOG() may create restartpoint or checkpoint,
it might generate WAL records.

I'm now not sure how much useful these changes are. As far as I read pgstat.c,
when shutdown is requested, the stats collector seems to exit even when
there are outstanding stats messages. So if checkpointer and walwriter send
the stats in their last cycles, those stats might not be collected.

On the other hand, I can think that sending the stats in the last cycles would
improve the situation a bit than now. So I'm inclined to apply those changes...

I didn't notice that. I agree this is an important aspect.
I understood there is a case that the stats collector exits before the checkpointer
or the walwriter exits and some stats might not be collected.
IIUC the stats collector basically exits after checkpointer and walwriter exit.
But there seems no guarantee that the stats collector processes
all the messages that other processes have sent during the shutdown of
the server.
Thanks, I understood the above postmaster behaviors.

PMState manages the status and after checkpointer is exited, the postmaster sends
SIGQUIT signal to the stats collector if the shutdown mode is smart or fast.
(IIUC, although the postmaster kill the walsender, the archiver and
the stats collector at the same time, it's ok because the walsender
and the archiver doesn't send stats to the stats collector now.)

But, there might be a corner case to lose stats sent by background workers like
the checkpointer before they exit (although this is not implemented yet.)

For example,

1. checkpointer send the stats before it exit
2. stats collector receive the signal and break before processing
the stats message from checkpointer. In this case, 1's message is lost.
3. stats collector writes the stats in the statsfiles and exit

Why don't you recheck the coming message is zero just before the 2th procedure?
(v17-0004-guarantee-to-collect-last-stats-messages.patch)

Yes, I was thinking the same. This is the straight-forward fix for this issue.
The stats collector should process all the outstanding messages when
normal shutdown is requested, as the patch does. On the other hand,
if immediate shutdown is requested or emergency bailout (by postmaster death)
is requested, maybe the stats collector should skip those processings
and exit immediately.

But if we implement that, we would need to teach the stats collector
the shutdown type (i.e., normal shutdown or immediate one). Because
currently SIGQUIT is sent to the collector whichever shutdown is requested,
and so the collector cannot distinguish the shutdown type. I'm afraid that
change is a bit overkill for now.

BTW, I found that the collector calls pgstat_write_statsfiles() even at
emergency bailout case, before exiting. It's not necessary to save
the stats to the file in that case because subsequent server startup does
crash recovery and clears that stats file. So it's better to make
the collector exit immediately without calling pgstat_write_statsfiles()
at emergency bailout case? Probably this should be discussed in other
thread because it's different topic from the feature we're discussing here,
though.

I measured the timing of the above in my linux laptop using v17-measure-timing.patch.
I don't have any strong opinion to handle this case since this result shows to receive and processes
the messages takes too short time (less than 1ms) although the stats collector receives the shutdown
signal in 5msec(099->104) after the checkpointer process exits.

Agreed.

```
1615421204.556 [checkpointer] DEBUG: received shutdown request signal
1615421208.099 [checkpointer] DEBUG: proc_exit(-1): 0 callbacks to make              # exit and send the messages
1615421208.099 [stats collector] DEBUG: process BGWRITER stats message              # receive and process the messages
1615421208.099 [stats collector] DEBUG: process WAL stats message
1615421208.104 [postmaster] DEBUG: reaping dead processes
1615421208.104 [stats collector] DEBUG: received shutdown request signal             # receive shutdown request from the postmaster
```

Of course, there is another direction; we can improve the stats collector so
that it guarantees to collect all the sent stats messages. But I'm afraid
this change might be big.

For example, implement to manage background process status in shared memory and
the stats collector collects the stats until another background process exits?

In my understanding, the statistics are not required high accuracy,
it's ok to ignore them if the impact is not big.

If we guarantee high accuracy, another background process like autovacuum launcher
must send the WAL stats because it accesses the system catalog and might generate
WAL records due to HOT update even though the possibility is low.

I thought the impact is small because the time uncollected stats are generated is
short compared to the time from startup. So, it's ok to ignore the remaining stats
when the process exists.

I agree that it's not worth changing lots of code to collect such stats.
But if we can implement that very simply, isn't it more worth doing
that than current situation because we may be able to collect more
accurate stats.

Yes, I agree.
I attached the patch to send the stats before the wal writer and the checkpointer exit.
(v17-0001-send-stats-for-walwriter-when-shutdown.patch, v17-0002-send-stats-for-checkpointer-when-shutdown.patch)

Thanks for making those patches! Firstly I'm reading 0001 and 0002 patches.

Here is the review comments for 0001 patch.

+/* Prototypes for private functions */
+static void HandleWalWriterInterrupts(void);

HandleWalWriterInterrupts() and HandleMainLoopInterrupts() are almost the same.
So I don't think that we need to introduce HandleWalWriterInterrupts(). Instead,
we can just call pgstat_send_wal(true) before HandleMainLoopInterrupts()
if ShutdownRequestPending is true in the main loop. Attached is the patch
I implemented that way. Thought?

Here is the review comments for 0002 patch.

+static void pgstat_send_checkpointer(void);

I'm inclined to avoid adding the function with the prefix "pgstat_" outside
pgstat.c. Instead, I'm ok to just call both pgstat_send_bgwriter() and
pgstat_report_wal() directly after ShutdownXLOG(). Thought? Patch attached.

BTW, I found BgWriterStats.m_timed_checkpoints is not counted in ShutdownLOG()
and we need to count it if to collect stats before it exits.

Maybe m_requested_checkpoints should be incremented in that case?

I thought this should be incremented
because it invokes the methods with CHECKPOINT_IS_SHUTDOWN.

Yes.

```ShutdownXLOG()
CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
```

I fixed in v17-0002-send-stats-for-checkpointer-when-shutdown.patch.

In addition, I rebased the patch for WAL receiver.
(v17-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks! Will review this later.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

v17-0001-send-stats-for-walwriter-when-shutdown_fujii.patchtext/plain; charset=UTF-8; name=v17-0001-send-stats-for-walwriter-when-shutdown_fujii.patch; x-mac-creator=0; x-mac-type=0Download

diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 132df29aba..1b30fde505 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -242,6 +242,17 @@ WalWriterMain(void)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * Force to send remaining WAL statistics to the stats collector at
+		 * process exit.
+		 *
+		 * Since pgstat_send_wal() is invoked with 'force' is false in main loop
+		 * to avoid overloading to the stats collector, there may exist unsent
+		 * stats counters for the WAL writer.
+		 */
+		if (ShutdownRequestPending)
+			pgstat_send_wal(true);
+
 		HandleMainLoopInterrupts();
 
 		/*

v17-0002-send-stats-for-checkpointer-when-shutdown_fujii.patchtext/plain; charset=UTF-8; name=v17-0002-send-stats-for-checkpointer-when-shutdown_fujii.patch; x-mac-creator=0; x-mac-type=0Download

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 57c4d5a5d9..88190e51a0 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -572,8 +572,19 @@ HandleCheckpointerInterrupts(void)
 		 * back to the sigsetjmp block above
 		 */
 		ExitOnAnyError = true;
-		/* Close down the database */
+
+		/*
+		 * Close down the database.
+		 *
+		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and updates
+		 * the statistics, increment the checkpoint request and send the
+		 * statistics to the stats collector.
+		 */
+		BgWriterStats.m_requested_checkpoints++;
 		ShutdownXLOG(0, 0);
+		pgstat_send_bgwriter();
+		pgstat_report_wal();
+
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
 	}

#55

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#54)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-03-11 11:52, Fujii Masao wrote:

On 2021/03/11 9:38, Masahiro Ikeda wrote:
On 2021-03-10 17:08, Fujii Masao wrote:
On 2021/03/10 14:11, Masahiro Ikeda wrote:
On 2021-03-09 17:51, Fujii Masao wrote:
On 2021/03/05 8:38, Masahiro Ikeda wrote:
On 2021-03-05 01:02, Fujii Masao wrote:
On 2021/03/04 16:14, Masahiro Ikeda wrote:
On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked
during an
<function>XLogFlush</function> request (see ...). This is
also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally
called" or
"which normally is called" if you want to keep true to the
original)
You missed the adding the space before an opening
parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly
query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note:
"This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the
WAL receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of
this event is
reported in wal_buffers_full in....) This is undesirable
because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require
explicitly
computing the sync statistics but does require computing
the write
statistics. This is because of the presence of
issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I
observe that
the XLogWrite code path calls pgstat_report_wait_*() while
the WAL
receiver path does not. It seems technically
straight-forward to
refactor here to avoid the almost-duplicated logic in the
two places,
though I suspect there may be a trade-off for not adding
another
function call to the stack given the importance of WAL
processing
(though that seems marginalized compared to the cost of
actually
writing the WAL). Or, as Fujii noted, go the other way
and don't have
any shared code between the two but instead implement the
WAL receiver
one to use pg_stat_wal_receiver instead. In either case,
this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver
stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL
receiver stats messages between the WAL receiver and the
stats collector, and
the stats for WAL receiver is counted in
pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those
stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver
process running
at that moment. IOW, it seems strange that some values show
dynamic
stats and the others show collected stats, even though they
are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in
pg_stat_wal view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right 
now */
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or
open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has 
elapsed to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes
wal_writer_delay
before walwriter's WAL stats is sent after
XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds,
some values in
pg_stat_wal would be not up-to-date meaninglessly for those
seconds.
So I'm thinking to withdraw my previous comment and it's ok to
send
the stats every after XLogBackgroundFlush() is called.
Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.
Yeah, if wal_writer_delay is set to very small value, there is a
risk
that the WAL stats are sent too frequently. I agree that's a
problem.
Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at 
least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, 
now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in
pgstat_send_wal(),
Agreed.

I didn't do so because to avoid to double check
PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already
checks the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?
Thanks, I thought it's better.
I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time
Yes, I fixed it.
-        case SYNC_METHOD_OPEN:
-        case SYNC_METHOD_OPEN_DSYNC:
-            /* write synced it already */
-            break;

IMO it's better to add Assert(false) here to ensure that we never
reach
here, as follows. Thought?
+        case SYNC_METHOD_OPEN:
+        case SYNC_METHOD_OPEN_DSYNC:
+            /* not reachable */
+            Assert(false);
I agree.

Even when a backend exits, it sends the stats via
pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also
should send
the stats even at its exit? Otherwise some stats can fail to be
collected.
But ISTM that this issue existed from before, for example
checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill
to fix
this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer
in v14-0003 patch.
Thanks for 0003 patch!

Isn't it overkill to send the stats in the walwriter-exit-callback?
IMO we can
just send the stats only when ShutdownRequestPending is true in the
walwriter
main loop (maybe just before calling HandleMainLoopInterrupts()).
If we do this, we cannot send the stats when walwriter throws FATAL
error.
But that's ok because FATAL error on walwriter causes the server to
crash.
Thought?
Thanks for your comments!
Yes, I agree.

Also ISTM that we don't need to use the callback for that purpose
in
checkpointer because of the same reason. That is, we can send the
stats
just after calling ShutdownXLOG(0, 0) in
HandleCheckpointerInterrupts().
Thought?

Yes, I think so too.

Since ShutdownXLOG() may create restartpoint or checkpoint,
it might generate WAL records.

I'm now not sure how much useful these changes are. As far as I
read pgstat.c,
when shutdown is requested, the stats collector seems to exit even
when
there are outstanding stats messages. So if checkpointer and
walwriter send
the stats in their last cycles, those stats might not be collected.

On the other hand, I can think that sending the stats in the last
cycles would
improve the situation a bit than now. So I'm inclined to apply
those changes...

I didn't notice that. I agree this is an important aspect.
I understood there is a case that the stats collector exits before
the checkpointer
or the walwriter exits and some stats might not be collected.
IIUC the stats collector basically exits after checkpointer and
walwriter exit.
But there seems no guarantee that the stats collector processes
all the messages that other processes have sent during the shutdown
of
the server.
Thanks, I understood the above postmaster behaviors.

PMState manages the status and after checkpointer is exited, the
postmaster sends
SIGQUIT signal to the stats collector if the shutdown mode is smart or
fast.
(IIUC, although the postmaster kill the walsender, the archiver and
the stats collector at the same time, it's ok because the walsender
and the archiver doesn't send stats to the stats collector now.)

But, there might be a corner case to lose stats sent by background
workers like
the checkpointer before they exit (although this is not implemented
yet.)

For example,

1. checkpointer send the stats before it exit
2. stats collector receive the signal and break before processing
the stats message from checkpointer. In this case, 1's message is
lost.
3. stats collector writes the stats in the statsfiles and exit

Why don't you recheck the coming message is zero just before the 2th
procedure?
(v17-0004-guarantee-to-collect-last-stats-messages.patch)
Yes, I was thinking the same. This is the straight-forward fix for this
issue.
The stats collector should process all the outstanding messages when
normal shutdown is requested, as the patch does. On the other hand,
if immediate shutdown is requested or emergency bailout (by postmaster
death)
is requested, maybe the stats collector should skip those processings
and exit immediately.

But if we implement that, we would need to teach the stats collector
the shutdown type (i.e., normal shutdown or immediate one). Because
currently SIGQUIT is sent to the collector whichever shutdown is
requested,
and so the collector cannot distinguish the shutdown type. I'm afraid
that
change is a bit overkill for now.

BTW, I found that the collector calls pgstat_write_statsfiles() even at
emergency bailout case, before exiting. It's not necessary to save
the stats to the file in that case because subsequent server startup
does
crash recovery and clears that stats file. So it's better to make
the collector exit immediately without calling
pgstat_write_statsfiles()
at emergency bailout case? Probably this should be discussed in other
thread because it's different topic from the feature we're discussing
here,
though.

IIUC, only the stats collector has another hander for SIGQUIT although
other background processes have a common hander for it and just call
_exit(2).
I thought to guarantee when TerminateChildren(SIGTERM) is invoked, don't
make stats
collector shutdown before other background processes are shutdown.

I will make another thread to discuss that the stats collector should
know the shutdown type or not.
If it should be, it's better to make the stats collector exit as soon as
possible if the shutdown type
is an immediate, and avoid losing the remaining stats if it's normal.

I measured the timing of the above in my linux laptop using
v17-measure-timing.patch.
I don't have any strong opinion to handle this case since this result
shows to receive and processes
the messages takes too short time (less than 1ms) although the stats
collector receives the shutdown
signal in 5msec(099->104) after the checkpointer process exits.

Agreed.

```
1615421204.556 [checkpointer] DEBUG: received shutdown request signal
1615421208.099 [checkpointer] DEBUG: proc_exit(-1): 0 callbacks to
make              # exit and send the messages
1615421208.099 [stats collector] DEBUG: process BGWRITER stats
message              # receive and process the messages
1615421208.099 [stats collector] DEBUG: process WAL stats message
1615421208.104 [postmaster] DEBUG: reaping dead processes
1615421208.104 [stats collector] DEBUG: received shutdown request
signal             # receive shutdown request from the postmaster
```

Of course, there is another direction; we can improve the stats
collector so
that it guarantees to collect all the sent stats messages. But I'm
afraid
this change might be big.

For example, implement to manage background process status in shared
memory and
the stats collector collects the stats until another background
process exits?

In my understanding, the statistics are not required high accuracy,
it's ok to ignore them if the impact is not big.

If we guarantee high accuracy, another background process like
autovacuum launcher
must send the WAL stats because it accesses the system catalog and
might generate
WAL records due to HOT update even though the possibility is low.

I thought the impact is small because the time uncollected stats are
generated is
short compared to the time from startup. So, it's ok to ignore the
remaining stats
when the process exists.

I agree that it's not worth changing lots of code to collect such
stats.
But if we can implement that very simply, isn't it more worth doing
that than current situation because we may be able to collect more
accurate stats.

Yes, I agree.
I attached the patch to send the stats before the wal writer and the
checkpointer exit.
(v17-0001-send-stats-for-walwriter-when-shutdown.patch,
v17-0002-send-stats-for-checkpointer-when-shutdown.patch)

Thanks for making those patches! Firstly I'm reading 0001 and 0002
patches.

Thanks for your comments and for making patches.

Here is the review comments for 0001 patch.
+/* Prototypes for private functions */
+static void HandleWalWriterInterrupts(void);
HandleWalWriterInterrupts() and HandleMainLoopInterrupts() are almost
the same.
So I don't think that we need to introduce HandleWalWriterInterrupts().
Instead,
we can just call pgstat_send_wal(true) before
HandleMainLoopInterrupts()
if ShutdownRequestPending is true in the main loop. Attached is the
patch
I implemented that way. Thought?

I thought there is a corner case that can't send the stats like

```
// First, ShutdownRequstPending = false

if (ShutdownRequestPending) // don't send the stats
pgstat_send_wal(true);

// receive signal and ShutdownRequestPending became true

HandleMainLoopInterrupts(); // proc exit without sending the stats

```

Is it ok because it almost never occurs?

Here is the review comments for 0002 patch.

+static void pgstat_send_checkpointer(void);

I'm inclined to avoid adding the function with the prefix "pgstat_"
outside
pgstat.c. Instead, I'm ok to just call both pgstat_send_bgwriter() and
pgstat_report_wal() directly after ShutdownXLOG(). Thought? Patch
attached.

Thanks. I agree.

BTW, I found BgWriterStats.m_timed_checkpoints is not counted in
ShutdownLOG()
and we need to count it if to collect stats before it exits.

Maybe m_requested_checkpoints should be incremented in that case?

I thought this should be incremented
because it invokes the methods with CHECKPOINT_IS_SHUTDOWN.

Yes.

OK, thanks.

```ShutdownXLOG()
CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
```

I fixed in v17-0002-send-stats-for-checkpointer-when-shutdown.patch.

In addition, I rebased the patch for WAL receiver.
(v17-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks! Will review this later.

Thanks a lot!

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#56

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#55)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/11 21:29, Masahiro Ikeda wrote:

On 2021-03-11 11:52, Fujii Masao wrote:
On 2021/03/11 9:38, Masahiro Ikeda wrote:
On 2021-03-10 17:08, Fujii Masao wrote:
On 2021/03/10 14:11, Masahiro Ikeda wrote:
On 2021-03-09 17:51, Fujii Masao wrote:
On 2021/03/05 8:38, Masahiro Ikeda wrote:
On 2021-03-05 01:02, Fujii Masao wrote:
On 2021/03/04 16:14, Masahiro Ikeda wrote:
On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds, some values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.
Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.
Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in pgstat_send_wal(),
Agreed.

I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?
Thanks, I thought it's better.
I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time
Yes, I fixed it.
-        case SYNC_METHOD_OPEN:
-        case SYNC_METHOD_OPEN_DSYNC:
-            /* write synced it already */
-            break;

IMO it's better to add Assert(false) here to ensure that we never reach
here, as follows. Thought?
+        case SYNC_METHOD_OPEN:
+        case SYNC_METHOD_OPEN_DSYNC:
+            /* not reachable */
+            Assert(false);
I agree.

Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should send
the stats even at its exit? Otherwise some stats can fail to be collected.
But ISTM that this issue existed from before, for example checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
Thanks for 0003 patch!

Isn't it overkill to send the stats in the walwriter-exit-callback? IMO we can
just send the stats only when ShutdownRequestPending is true in the walwriter
main loop (maybe just before calling HandleMainLoopInterrupts()).
If we do this, we cannot send the stats when walwriter throws FATAL error.
But that's ok because FATAL error on walwriter causes the server to crash.
Thought?
Thanks for your comments!
Yes, I agree.

Also ISTM that we don't need to use the callback for that purpose in
checkpointer because of the same reason. That is, we can send the stats
just after calling ShutdownXLOG(0, 0) in HandleCheckpointerInterrupts().
Thought?

Yes, I think so too.

Since ShutdownXLOG() may create restartpoint or checkpoint,
it might generate WAL records.

I'm now not sure how much useful these changes are. As far as I read pgstat.c,
when shutdown is requested, the stats collector seems to exit even when
there are outstanding stats messages. So if checkpointer and walwriter send
the stats in their last cycles, those stats might not be collected.

On the other hand, I can think that sending the stats in the last cycles would
improve the situation a bit than now. So I'm inclined to apply those changes...

I didn't notice that. I agree this is an important aspect.
I understood there is a case that the stats collector exits before the checkpointer
or the walwriter exits and some stats might not be collected.
IIUC the stats collector basically exits after checkpointer and walwriter exit.
But there seems no guarantee that the stats collector processes
all the messages that other processes have sent during the shutdown of
the server.
Thanks, I understood the above postmaster behaviors.

PMState manages the status and after checkpointer is exited, the postmaster sends
SIGQUIT signal to the stats collector if the shutdown mode is smart or fast.
(IIUC, although the postmaster kill the walsender, the archiver and
the stats collector at the same time, it's ok because the walsender
and the archiver doesn't send stats to the stats collector now.)

But, there might be a corner case to lose stats sent by background workers like
the checkpointer before they exit (although this is not implemented yet.)

For example,

1. checkpointer send the stats before it exit
2. stats collector receive the signal and break before processing
the stats message from checkpointer. In this case, 1's message is lost.
3. stats collector writes the stats in the statsfiles and exit

Why don't you recheck the coming message is zero just before the 2th procedure?
(v17-0004-guarantee-to-collect-last-stats-messages.patch)
Yes, I was thinking the same. This is the straight-forward fix for this issue.
The stats collector should process all the outstanding messages when
normal shutdown is requested, as the patch does. On the other hand,
if immediate shutdown is requested or emergency bailout (by postmaster death)
is requested, maybe the stats collector should skip those processings
and exit immediately.

But if we implement that, we would need to teach the stats collector
the shutdown type (i.e., normal shutdown or immediate one). Because
currently SIGQUIT is sent to the collector whichever shutdown is requested,
and so the collector cannot distinguish the shutdown type. I'm afraid that
change is a bit overkill for now.

BTW, I found that the collector calls pgstat_write_statsfiles() even at
emergency bailout case, before exiting. It's not necessary to save
the stats to the file in that case because subsequent server startup does
crash recovery and clears that stats file. So it's better to make
the collector exit immediately without calling pgstat_write_statsfiles()
at emergency bailout case? Probably this should be discussed in other
thread because it's different topic from the feature we're discussing here,
though.
IIUC, only the stats collector has another hander for SIGQUIT although
other background processes have a common hander for it and just call _exit(2).
I thought to guarantee when TerminateChildren(SIGTERM) is invoked, don't make stats
collector shutdown before other background processes are shutdown.

I will make another thread to discuss that the stats collector should know the shutdown type or not.
If it should be, it's better to make the stats collector exit as soon as possible if the shutdown type
is an immediate, and avoid losing the remaining stats if it's normal.

I measured the timing of the above in my linux laptop using v17-measure-timing.patch.
I don't have any strong opinion to handle this case since this result shows to receive and processes
the messages takes too short time (less than 1ms) although the stats collector receives the shutdown
signal in 5msec(099->104) after the checkpointer process exits.

Agreed.

```
1615421204.556 [checkpointer] DEBUG: received shutdown request signal
1615421208.099 [checkpointer] DEBUG: proc_exit(-1): 0 callbacks to make              # exit and send the messages
1615421208.099 [stats collector] DEBUG: process BGWRITER stats message              # receive and process the messages
1615421208.099 [stats collector] DEBUG: process WAL stats message
1615421208.104 [postmaster] DEBUG: reaping dead processes
1615421208.104 [stats collector] DEBUG: received shutdown request signal             # receive shutdown request from the postmaster
```

Of course, there is another direction; we can improve the stats collector so
that it guarantees to collect all the sent stats messages. But I'm afraid
this change might be big.

For example, implement to manage background process status in shared memory and
the stats collector collects the stats until another background process exits?

In my understanding, the statistics are not required high accuracy,
it's ok to ignore them if the impact is not big.

If we guarantee high accuracy, another background process like autovacuum launcher
must send the WAL stats because it accesses the system catalog and might generate
WAL records due to HOT update even though the possibility is low.

I thought the impact is small because the time uncollected stats are generated is
short compared to the time from startup. So, it's ok to ignore the remaining stats
when the process exists.

I agree that it's not worth changing lots of code to collect such stats.
But if we can implement that very simply, isn't it more worth doing
that than current situation because we may be able to collect more
accurate stats.

Yes, I agree.
I attached the patch to send the stats before the wal writer and the checkpointer exit.
(v17-0001-send-stats-for-walwriter-when-shutdown.patch, v17-0002-send-stats-for-checkpointer-when-shutdown.patch)

Thanks for making those patches! Firstly I'm reading 0001 and 0002 patches.

Thanks for your comments and for making patches.
Here is the review comments for 0001 patch.
+/* Prototypes for private functions */
+static void HandleWalWriterInterrupts(void);
HandleWalWriterInterrupts() and HandleMainLoopInterrupts() are almost the same.
So I don't think that we need to introduce HandleWalWriterInterrupts(). Instead,
we can just call pgstat_send_wal(true) before HandleMainLoopInterrupts()
if ShutdownRequestPending is true in the main loop. Attached is the patch
I implemented that way. Thought?
I thought there is a corner case that can't send the stats like

You're right! So IMO your patch (v17-0001-send-stats-for-walwriter-when-shutdown.patch) is better.

```
// First, ShutdownRequstPending = false

    if (ShutdownRequestPending)    // don't send the stats
        pgstat_send_wal(true);

// receive signal and ShutdownRequestPending became true

    HandleMainLoopInterrupts();   // proc exit without sending the stats

```

Is it ok because it almost never occurs?

Here is the review comments for 0002 patch.

+static void pgstat_send_checkpointer(void);

I'm inclined to avoid adding the function with the prefix "pgstat_" outside
pgstat.c. Instead, I'm ok to just call both pgstat_send_bgwriter() and
pgstat_report_wal() directly after ShutdownXLOG(). Thought? Patch attached.

Thanks. I agree.

Thanks for the review!

So, barring any objection, I will commit the changes for
walwriter and checkpointer. That is,

v17-0001-send-stats-for-walwriter-when-shutdown.patch
v17-0002-send-stats-for-checkpointer-when-shutdown_fujii.patch

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#57

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#55)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/11 21:29, Masahiro Ikeda wrote:

On 2021-03-11 11:52, Fujii Masao wrote:
On 2021/03/11 9:38, Masahiro Ikeda wrote:
On 2021-03-10 17:08, Fujii Masao wrote:
On 2021/03/10 14:11, Masahiro Ikeda wrote:
On 2021-03-09 17:51, Fujii Masao wrote:
On 2021/03/05 8:38, Masahiro Ikeda wrote:
On 2021-03-05 01:02, Fujii Masao wrote:
On 2021/03/04 16:14, Masahiro Ikeda wrote:
On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds, some values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.
Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.
Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in pgstat_send_wal(),
Agreed.

I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?
Thanks, I thought it's better.
I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time
Yes, I fixed it.
-        case SYNC_METHOD_OPEN:
-        case SYNC_METHOD_OPEN_DSYNC:
-            /* write synced it already */
-            break;

IMO it's better to add Assert(false) here to ensure that we never reach
here, as follows. Thought?
+        case SYNC_METHOD_OPEN:
+        case SYNC_METHOD_OPEN_DSYNC:
+            /* not reachable */
+            Assert(false);
I agree.

Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should send
the stats even at its exit? Otherwise some stats can fail to be collected.
But ISTM that this issue existed from before, for example checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
Thanks for 0003 patch!

Isn't it overkill to send the stats in the walwriter-exit-callback? IMO we can
just send the stats only when ShutdownRequestPending is true in the walwriter
main loop (maybe just before calling HandleMainLoopInterrupts()).
If we do this, we cannot send the stats when walwriter throws FATAL error.
But that's ok because FATAL error on walwriter causes the server to crash.
Thought?
Thanks for your comments!
Yes, I agree.

Also ISTM that we don't need to use the callback for that purpose in
checkpointer because of the same reason. That is, we can send the stats
just after calling ShutdownXLOG(0, 0) in HandleCheckpointerInterrupts().
Thought?

Yes, I think so too.

Since ShutdownXLOG() may create restartpoint or checkpoint,
it might generate WAL records.

I'm now not sure how much useful these changes are. As far as I read pgstat.c,
when shutdown is requested, the stats collector seems to exit even when
there are outstanding stats messages. So if checkpointer and walwriter send
the stats in their last cycles, those stats might not be collected.

On the other hand, I can think that sending the stats in the last cycles would
improve the situation a bit than now. So I'm inclined to apply those changes...

I didn't notice that. I agree this is an important aspect.
I understood there is a case that the stats collector exits before the checkpointer
or the walwriter exits and some stats might not be collected.
IIUC the stats collector basically exits after checkpointer and walwriter exit.
But there seems no guarantee that the stats collector processes
all the messages that other processes have sent during the shutdown of
the server.
Thanks, I understood the above postmaster behaviors.

PMState manages the status and after checkpointer is exited, the postmaster sends
SIGQUIT signal to the stats collector if the shutdown mode is smart or fast.
(IIUC, although the postmaster kill the walsender, the archiver and
the stats collector at the same time, it's ok because the walsender
and the archiver doesn't send stats to the stats collector now.)

But, there might be a corner case to lose stats sent by background workers like
the checkpointer before they exit (although this is not implemented yet.)

For example,

1. checkpointer send the stats before it exit
2. stats collector receive the signal and break before processing
the stats message from checkpointer. In this case, 1's message is lost.
3. stats collector writes the stats in the statsfiles and exit

Why don't you recheck the coming message is zero just before the 2th procedure?
(v17-0004-guarantee-to-collect-last-stats-messages.patch)
Yes, I was thinking the same. This is the straight-forward fix for this issue.
The stats collector should process all the outstanding messages when
normal shutdown is requested, as the patch does. On the other hand,
if immediate shutdown is requested or emergency bailout (by postmaster death)
is requested, maybe the stats collector should skip those processings
and exit immediately.

But if we implement that, we would need to teach the stats collector
the shutdown type (i.e., normal shutdown or immediate one). Because
currently SIGQUIT is sent to the collector whichever shutdown is requested,
and so the collector cannot distinguish the shutdown type. I'm afraid that
change is a bit overkill for now.

BTW, I found that the collector calls pgstat_write_statsfiles() even at
emergency bailout case, before exiting. It's not necessary to save
the stats to the file in that case because subsequent server startup does
crash recovery and clears that stats file. So it's better to make
the collector exit immediately without calling pgstat_write_statsfiles()
at emergency bailout case? Probably this should be discussed in other
thread because it's different topic from the feature we're discussing here,
though.
IIUC, only the stats collector has another hander for SIGQUIT although
other background processes have a common hander for it and just call _exit(2).
I thought to guarantee when TerminateChildren(SIGTERM) is invoked, don't make stats
collector shutdown before other background processes are shutdown.

I will make another thread to discuss that the stats collector should know the shutdown type or not.
If it should be, it's better to make the stats collector exit as soon as possible if the shutdown type
is an immediate, and avoid losing the remaining stats if it's normal.

I measured the timing of the above in my linux laptop using v17-measure-timing.patch.
I don't have any strong opinion to handle this case since this result shows to receive and processes
the messages takes too short time (less than 1ms) although the stats collector receives the shutdown
signal in 5msec(099->104) after the checkpointer process exits.

Agreed.

```
1615421204.556 [checkpointer] DEBUG: received shutdown request signal
1615421208.099 [checkpointer] DEBUG: proc_exit(-1): 0 callbacks to make              # exit and send the messages
1615421208.099 [stats collector] DEBUG: process BGWRITER stats message              # receive and process the messages
1615421208.099 [stats collector] DEBUG: process WAL stats message
1615421208.104 [postmaster] DEBUG: reaping dead processes
1615421208.104 [stats collector] DEBUG: received shutdown request signal             # receive shutdown request from the postmaster
```

Of course, there is another direction; we can improve the stats collector so
that it guarantees to collect all the sent stats messages. But I'm afraid
this change might be big.

For example, implement to manage background process status in shared memory and
the stats collector collects the stats until another background process exits?

In my understanding, the statistics are not required high accuracy,
it's ok to ignore them if the impact is not big.

If we guarantee high accuracy, another background process like autovacuum launcher
must send the WAL stats because it accesses the system catalog and might generate
WAL records due to HOT update even though the possibility is low.

I thought the impact is small because the time uncollected stats are generated is
short compared to the time from startup. So, it's ok to ignore the remaining stats
when the process exists.

I agree that it's not worth changing lots of code to collect such stats.
But if we can implement that very simply, isn't it more worth doing
that than current situation because we may be able to collect more
accurate stats.

Yes, I agree.
I attached the patch to send the stats before the wal writer and the checkpointer exit.
(v17-0001-send-stats-for-walwriter-when-shutdown.patch, v17-0002-send-stats-for-checkpointer-when-shutdown.patch)

Thanks for making those patches! Firstly I'm reading 0001 and 0002 patches.

Thanks for your comments and for making patches.
Here is the review comments for 0001 patch.
+/* Prototypes for private functions */
+static void HandleWalWriterInterrupts(void);
HandleWalWriterInterrupts() and HandleMainLoopInterrupts() are almost the same.
So I don't think that we need to introduce HandleWalWriterInterrupts(). Instead,
we can just call pgstat_send_wal(true) before HandleMainLoopInterrupts()
if ShutdownRequestPending is true in the main loop. Attached is the patch
I implemented that way. Thought?
I thought there is a corner case that can't send the stats like

```
// First, ShutdownRequstPending = false

    if (ShutdownRequestPending)    // don't send the stats
        pgstat_send_wal(true);

// receive signal and ShutdownRequestPending became true

    HandleMainLoopInterrupts();   // proc exit without sending the stats

```

Is it ok because it almost never occurs?

Here is the review comments for 0002 patch.

+static void pgstat_send_checkpointer(void);

I'm inclined to avoid adding the function with the prefix "pgstat_" outside
pgstat.c. Instead, I'm ok to just call both pgstat_send_bgwriter() and
pgstat_report_wal() directly after ShutdownXLOG(). Thought? Patch attached.

Thanks. I agree.

BTW, I found BgWriterStats.m_timed_checkpoints is not counted in ShutdownLOG()
and we need to count it if to collect stats before it exits.

Maybe m_requested_checkpoints should be incremented in that case?

I thought this should be incremented
because it invokes the methods with CHECKPOINT_IS_SHUTDOWN.

Yes.

OK, thanks.

```ShutdownXLOG()
   CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
   CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
```

I fixed in v17-0002-send-stats-for-checkpointer-when-shutdown.patch.

In addition, I rebased the patch for WAL receiver.
(v17-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks! Will review this later.

Thanks a lot!

I read through the 0003 patch. Here are some comments for that.

With the patch, walreceiver's stats are counted as wal_write, wal_sync, wal_write_time and wal_sync_time in pg_stat_wal. But they should be counted as different columns because WAL IO is different between walreceiver and other processes like a backend? For example, open_sync or open_datasync is chosen as wal_sync_method, those other processes use O_DIRECT flag to open WAL files, but walreceiver does not. For example, those other procesess write WAL data in block units, but walreceiver does not. So I'm concerned that mixing different WAL IO stats in the same columns would confuse the users. Thought? I'd like to hear more opinions about how to expose walreceiver's stats to users.

+int
+XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset)

This common function writes WAL data and measures IO timing. IMO we can refactor the code furthermore by making this function handle the case where pg_write() reports an error. In other words, I think that the function should do what do-while loop block in XLogWrite() does. Thought?

BTW, currently XLogWrite() increments IO timing even when pg_pwrite() reports an error. But this is useless. Probably IO timing should be incremented after the return code of pg_pwrite() is checked, instead?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#58

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#56)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/11 23:33, Fujii Masao wrote:

On 2021/03/11 21:29, Masahiro Ikeda wrote:
On 2021-03-11 11:52, Fujii Masao wrote:
On 2021/03/11 9:38, Masahiro Ikeda wrote:
On 2021-03-10 17:08, Fujii Masao wrote:
On 2021/03/10 14:11, Masahiro Ikeda wrote:
On 2021-03-09 17:51, Fujii Masao wrote:
On 2021/03/05 8:38, Masahiro Ikeda wrote:
On 2021-03-05 01:02, Fujii Masao wrote:
On 2021/03/04 16:14, Masahiro Ikeda wrote:
On 2021-03-03 20:27, Masahiro Ikeda wrote:
On 2021-03-03 16:30, Fujii Masao wrote:
On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I pgindented the patches.

... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"

Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not. It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL). Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead. In either case, this
half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.
+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+        (sync_method == SYNC_METHOD_FSYNC ||
+         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+         sync_method == SYNC_METHOD_FDATASYNC))
+    {
Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?
Thanks for the comments.
I added the above code in v12 patch.
+        /*
+         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+         * the overhead in WAL-writing.
+         */
+        if (rc & WL_TIMEOUT)
+            pgstat_send_wal();
On second thought, this change means that it always takes wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds, some values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?
Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.
Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.
Why don't to make another way to check the timestamp?
+               /*
+                * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+                * msec since we last sent one
+                */
+               now = GetCurrentTimestamp();
+               if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+               {
+                       pgstat_send_wal();
+                       last_report = now;
+               }
+
Although I worried that it's better to add the check code in pgstat_send_wal(),
Agreed.

I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks the
PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?
Thanks, I thought it's better.
I forgot to remove an unused variable.
The attached v13 patch is fixed.

Thanks for updating the patch!
+        w.wal_write,
+        w.wal_write_time,
+        w.wal_sync,
+        w.wal_sync_time,
It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time
Yes, I fixed it.
-        case SYNC_METHOD_OPEN:
-        case SYNC_METHOD_OPEN_DSYNC:
-            /* write synced it already */
-            break;

IMO it's better to add Assert(false) here to ensure that we never reach
here, as follows. Thought?
+        case SYNC_METHOD_OPEN:
+        case SYNC_METHOD_OPEN_DSYNC:
+            /* not reachable */
+            Assert(false);
I agree.

Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should send
the stats even at its exit? Otherwise some stats can fail to be collected.
But ISTM that this issue existed from before, for example checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
Thanks for 0003 patch!

Isn't it overkill to send the stats in the walwriter-exit-callback? IMO we can
just send the stats only when ShutdownRequestPending is true in the walwriter
main loop (maybe just before calling HandleMainLoopInterrupts()).
If we do this, we cannot send the stats when walwriter throws FATAL error.
But that's ok because FATAL error on walwriter causes the server to crash.
Thought?
Thanks for your comments!
Yes, I agree.

Also ISTM that we don't need to use the callback for that purpose in
checkpointer because of the same reason. That is, we can send the stats
just after calling ShutdownXLOG(0, 0) in HandleCheckpointerInterrupts().
Thought?

Yes, I think so too.

Since ShutdownXLOG() may create restartpoint or checkpoint,
it might generate WAL records.

I'm now not sure how much useful these changes are. As far as I read pgstat.c,
when shutdown is requested, the stats collector seems to exit even when
there are outstanding stats messages. So if checkpointer and walwriter send
the stats in their last cycles, those stats might not be collected.

On the other hand, I can think that sending the stats in the last cycles would
improve the situation a bit than now. So I'm inclined to apply those changes...

I didn't notice that. I agree this is an important aspect.
I understood there is a case that the stats collector exits before the checkpointer
or the walwriter exits and some stats might not be collected.
IIUC the stats collector basically exits after checkpointer and walwriter exit.
But there seems no guarantee that the stats collector processes
all the messages that other processes have sent during the shutdown of
the server.
Thanks, I understood the above postmaster behaviors.

PMState manages the status and after checkpointer is exited, the postmaster sends
SIGQUIT signal to the stats collector if the shutdown mode is smart or fast.
(IIUC, although the postmaster kill the walsender, the archiver and
the stats collector at the same time, it's ok because the walsender
and the archiver doesn't send stats to the stats collector now.)

But, there might be a corner case to lose stats sent by background workers like
the checkpointer before they exit (although this is not implemented yet.)

For example,

1. checkpointer send the stats before it exit
2. stats collector receive the signal and break before processing
the stats message from checkpointer. In this case, 1's message is lost.
3. stats collector writes the stats in the statsfiles and exit

Why don't you recheck the coming message is zero just before the 2th procedure?
(v17-0004-guarantee-to-collect-last-stats-messages.patch)
Yes, I was thinking the same. This is the straight-forward fix for this issue.
The stats collector should process all the outstanding messages when
normal shutdown is requested, as the patch does. On the other hand,
if immediate shutdown is requested or emergency bailout (by postmaster death)
is requested, maybe the stats collector should skip those processings
and exit immediately.

But if we implement that, we would need to teach the stats collector
the shutdown type (i.e., normal shutdown or immediate one). Because
currently SIGQUIT is sent to the collector whichever shutdown is requested,
and so the collector cannot distinguish the shutdown type. I'm afraid that
change is a bit overkill for now.

BTW, I found that the collector calls pgstat_write_statsfiles() even at
emergency bailout case, before exiting. It's not necessary to save
the stats to the file in that case because subsequent server startup does
crash recovery and clears that stats file. So it's better to make
the collector exit immediately without calling pgstat_write_statsfiles()
at emergency bailout case? Probably this should be discussed in other
thread because it's different topic from the feature we're discussing here,
though.
IIUC, only the stats collector has another hander for SIGQUIT although
other background processes have a common hander for it and just call _exit(2).
I thought to guarantee when TerminateChildren(SIGTERM) is invoked, don't make stats
collector shutdown before other background processes are shutdown.

I will make another thread to discuss that the stats collector should know the shutdown type or not.
If it should be, it's better to make the stats collector exit as soon as possible if the shutdown type
is an immediate, and avoid losing the remaining stats if it's normal.
+1
I measured the timing of the above in my linux laptop using v17-measure-timing.patch.
I don't have any strong opinion to handle this case since this result shows to receive and processes
the messages takes too short time (less than 1ms) although the stats collector receives the shutdown
signal in 5msec(099->104) after the checkpointer process exits.

Agreed.

```
1615421204.556 [checkpointer] DEBUG: received shutdown request signal
1615421208.099 [checkpointer] DEBUG: proc_exit(-1): 0 callbacks to make              # exit and send the messages
1615421208.099 [stats collector] DEBUG: process BGWRITER stats message              # receive and process the messages
1615421208.099 [stats collector] DEBUG: process WAL stats message
1615421208.104 [postmaster] DEBUG: reaping dead processes
1615421208.104 [stats collector] DEBUG: received shutdown request signal             # receive shutdown request from the postmaster
```

Of course, there is another direction; we can improve the stats collector so
that it guarantees to collect all the sent stats messages. But I'm afraid
this change might be big.

For example, implement to manage background process status in shared memory and
the stats collector collects the stats until another background process exits?

In my understanding, the statistics are not required high accuracy,
it's ok to ignore them if the impact is not big.

If we guarantee high accuracy, another background process like autovacuum launcher
must send the WAL stats because it accesses the system catalog and might generate
WAL records due to HOT update even though the possibility is low.

I thought the impact is small because the time uncollected stats are generated is
short compared to the time from startup. So, it's ok to ignore the remaining stats
when the process exists.

I agree that it's not worth changing lots of code to collect such stats.
But if we can implement that very simply, isn't it more worth doing
that than current situation because we may be able to collect more
accurate stats.

Yes, I agree.
I attached the patch to send the stats before the wal writer and the checkpointer exit.
(v17-0001-send-stats-for-walwriter-when-shutdown.patch, v17-0002-send-stats-for-checkpointer-when-shutdown.patch)

Thanks for making those patches! Firstly I'm reading 0001 and 0002 patches.

Thanks for your comments and for making patches.
Here is the review comments for 0001 patch.
+/* Prototypes for private functions */
+static void HandleWalWriterInterrupts(void);
HandleWalWriterInterrupts() and HandleMainLoopInterrupts() are almost the same.
So I don't think that we need to introduce HandleWalWriterInterrupts(). Instead,
we can just call pgstat_send_wal(true) before HandleMainLoopInterrupts()
if ShutdownRequestPending is true in the main loop. Attached is the patch
I implemented that way. Thought?
I thought there is a corner case that can't send the stats like
You're right! So IMO your patch (v17-0001-send-stats-for-walwriter-when-shutdown.patch) is better.

```
// First, ShutdownRequstPending = false

     if (ShutdownRequestPending)    // don't send the stats
         pgstat_send_wal(true);

// receive signal and ShutdownRequestPending became true

     HandleMainLoopInterrupts();   // proc exit without sending the stats

```

Is it ok because it almost never occurs?

Here is the review comments for 0002 patch.

+static void pgstat_send_checkpointer(void);

I'm inclined to avoid adding the function with the prefix "pgstat_" outside
pgstat.c. Instead, I'm ok to just call both pgstat_send_bgwriter() and
pgstat_report_wal() directly after ShutdownXLOG(). Thought? Patch attached.

Thanks. I agree.

Thanks for the review!

So, barring any objection, I will commit the changes for
walwriter and checkpointer. That is,

v17-0001-send-stats-for-walwriter-when-shutdown.patch
v17-0002-send-stats-for-checkpointer-when-shutdown_fujii.patch

I pushed these two patches.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#59

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#57)

2 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-03-12 12:39, Fujii Masao wrote:

On 2021/03/11 21:29, Masahiro Ikeda wrote:

In addition, I rebased the patch for WAL receiver.
(v17-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks! Will review this later.

Thanks a lot!

I read through the 0003 patch. Here are some comments for that.

With the patch, walreceiver's stats are counted as wal_write,
wal_sync, wal_write_time and wal_sync_time in pg_stat_wal. But they
should be counted as different columns because WAL IO is different
between walreceiver and other processes like a backend? For example,
open_sync or open_datasync is chosen as wal_sync_method, those other
processes use O_DIRECT flag to open WAL files, but walreceiver does
not. For example, those other procesess write WAL data in block units,
but walreceiver does not. So I'm concerned that mixing different WAL
IO stats in the same columns would confuse the users. Thought? I'd
like to hear more opinions about how to expose walreceiver's stats to
users.

Thanks, I understood get_sync_bit() checks the sync flags and
the write unit of generated wal data and replicated wal data is
different.
(It's interesting optimization whether to use kernel cache or not.)

OK. Although I agree to separate the stats for the walrecever,
I want to hear opinions from other people too. I didn't change the
patch.

Please feel to your comments.

+int
+XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset)
This common function writes WAL data and measures IO timing. IMO we
can refactor the code furthermore by making this function handle the
case where pg_write() reports an error. In other words, I think that
the function should do what do-while loop block in XLogWrite() does.
Thought?

OK. I agree.

I wonder to change the error check ways depending on who calls this
function?
Now, only the walreceiver checks (1)errno==0 and doesn't check
(2)errno==ENITR.
Other processes are the opposite.

IIUC, it's appropriate that every process checks (1)(2).
Please let me know my understanding is wrong.

BTW, currently XLogWrite() increments IO timing even when pg_pwrite()
reports an error. But this is useless. Probably IO timing should be
incremented after the return code of pg_pwrite() is checked, instead?

Yes, I agree. I fixed it.
(v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

BTW, thanks for your comments in person that the bgwriter process will
generate wal data.
I checked that it generates the WAL to take a snapshot via
LogStandySnapshot().
I attached the patch for bgwriter to send the wal stats.
(v18-0005-send-stats-for-bgwriter-when-shutdown.patch)

This patch includes the following changes.

(1) introduce pgstat_send_bgwriter() the mechanism to send the stats
if PGSTAT_STAT_INTERVAL msec has passed like pgstat_send_wal()
to avoid overloading to stats collector because "bgwriter_delay"
can be set for 10msec or more.

(2) remove pgstat_report_wal() and integrate with pgstat_send_wal()
because bgwriter sends WalStats.m_wal_records and to avoid
overloading (see (1)).
IIUC, although the pros to separate them is to reduce the
calculation cost of
WalUsageAccumDiff(), the impact is limited.

(3) make a new signal handler for bgwriter to force sending remaining
stats during shutdown
because of (1) and remove HandleMainLoopInterrupts() because there
are no processes to use it.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patchtext/x-diff; name=v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9b2eb0d10b..c7bda9127f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2536,7 +2536,6 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2544,49 +2543,9 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			nleft = nbytes;
 			do
 			{
-				errno = 0;
+				written = XLogWriteFile(openLogFile, from, nleft, (off_t) startoffset,
+										ThisTimeLineID, openLogSegNo, wal_segment_size);
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-
-				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-				written = pg_pwrite(openLogFile, from, nleft, startoffset);
-				pgstat_report_wait_end();
-
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	duration;
-
-					INSTR_TIME_SET_CURRENT(duration);
-					INSTR_TIME_SUBTRACT(duration, start);
-					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
-				}
-
-				WalStats.m_wal_write++;
-
-				if (written <= 0)
-				{
-					char		xlogfname[MAXFNAMELEN];
-					int			save_errno;
-
-					if (errno == EINTR)
-						continue;
-
-					save_errno = errno;
-					XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
-								 wal_segment_size);
-					errno = save_errno;
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not write to log file %s "
-									"at offset %u, length %zu: %m",
-									xlogfname, startoffset, nleft)));
-				}
 				nleft -= written;
 				from += written;
 				startoffset += written;
@@ -2707,6 +2666,82 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	}
 }
 
+/*
+ * Issue pg_pwrite to write an WAL segment file.
+ *
+ * 'fd' is a file descriptor for the XLOG file to write
+ * 'buf' is a buffer starting address to write.
+ * 'nbyte' is a number of max bytes to write up.
+ * 'offset' is a offset of XLOG file to be set.
+ * 'timelineid' is a timeline ID of WAL segment file.
+ * 'segno' is a segment number of WAL segment file.
+ * 'segsize' is a segment size of WAL segment file.
+ *
+ * Return the number of bytes written.
+ */
+int
+XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset,
+			  TimeLineID timelineid, XLogSegNo segno, int segsize)
+{
+	/*
+	 * Loop until to write the buffer data or an error occurred.
+	 */
+	for (;;)
+	{
+		int			written;
+		instr_time	start;
+
+		errno = 0;
+
+		/* Measure I/O timing to write WAL data */
+		if (track_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
+		pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+		written = pg_pwrite(fd, buf, nbyte, offset);
+		pgstat_report_wait_end();
+
+		if (written <= 0)
+		{
+			char		xlogfname[MAXFNAMELEN];
+			int			save_errno;
+
+			if (errno == EINTR)
+				continue;
+
+			/* if write didn't set errno, assume no disk space */
+			if (errno == 0)
+				errno = ENOSPC;
+
+			save_errno = errno;
+			XLogFileName(xlogfname, timelineid, segno, segsize);
+			errno = save_errno;
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not write to log file %s "
+							"at offset %u, length %zu: %m",
+							xlogfname, (unsigned int) offset, (unsigned long) nbyte)));
+		}
+
+		/*
+		 * Increment the I/O timing and the number of times WAL data were
+		 * written out to disk.
+		 */
+		if (track_wal_io_timing)
+		{
+			instr_time	duration;
+
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+		}
+
+		WalStats.m_wal_write++;
+
+		return written;
+	}
+}
+
 /*
  * Record the LSN for an asynchronous transaction commit/abort
  * and nudge the WALWriter if there is work for it to do.
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a7a94d2a83..e9de78ffaa 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -771,6 +771,9 @@ WalRcvDie(int code, Datum arg)
 	/* Ensure that all WAL records received are flushed to disk */
 	XLogWalRcvFlush(true);
 
+	/* Send WAL statistics to the stats collector before terminating */
+	pgstat_send_wal(true);
+
 	/* Mark ourselves inactive in shared memory */
 	SpinLockAcquire(&walrcv->mutex);
 	Assert(walrcv->walRcvState == WALRCV_STREAMING ||
@@ -868,7 +871,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 static void
 XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 {
-	int			startoff;
+	uint32		startoff;
 	int			byteswritten;
 
 	while (nbytes > 0)
@@ -910,6 +913,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveForceDone(xlogfname);
 				else
 					XLogArchiveNotify(xlogfname);
+
+				/*
+				 * Send WAL statistics to the stats collector when finishing
+				 * the current WAL segment file to avoid overloading it.
+				 */
+				pgstat_send_wal(false);
 			}
 			recvFile = -1;
 
@@ -929,27 +938,8 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 			segbytes = nbytes;
 
 		/* OK to write the logs */
-		errno = 0;
-
-		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
-		if (byteswritten <= 0)
-		{
-			char		xlogfname[MAXFNAMELEN];
-			int			save_errno;
-
-			/* if write didn't set errno, assume no disk space */
-			if (errno == 0)
-				errno = ENOSPC;
-
-			save_errno = errno;
-			XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-			errno = save_errno;
-			ereport(PANIC,
-					(errcode_for_file_access(),
-					 errmsg("could not write to log segment %s "
-							"at offset %u, length %lu: %m",
-							xlogfname, startoff, (unsigned long) segbytes)));
-		}
+		byteswritten = XLogWriteFile(recvFile, buf, segbytes, (off_t) startoff,
+									 recvFileTLI, recvSegNo, wal_segment_size);
 
 		/* Update state for write */
 		recptr += byteswritten;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6d384d3ce6..fd478b3478 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -298,6 +298,10 @@ extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
 extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
 extern int	XLogFileOpen(XLogSegNo segno);
+extern int	XLogWriteFile(int fd, const void *buf,
+						  size_t nbyte, off_t offset,
+						  TimeLineID timelineid, XLogSegNo segno,
+						  int segsize);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);

v18-0005-send-stats-for-bgwriter-when-shutdown.patchtext/x-diff; name=v18-0005-send-stats-for-bgwriter-when-shutdown.patchDownload

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb..7cd01b38e9 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -83,6 +83,8 @@ int			BgWriterDelay = 200;
 static TimestampTz last_snapshot_ts;
 static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
 
+/* Prototypes for private functions */
+static void HandleBackgroundWriterInterrupts(void);
 
 /*
  * Main entry point for bgwriter process
@@ -236,7 +238,7 @@ BackgroundWriterMain(void)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
-		HandleMainLoopInterrupts();
+		HandleBackgroundWriterInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
@@ -244,9 +246,11 @@ BackgroundWriterMain(void)
 		can_hibernate = BgBufferSync(&wb_context);
 
 		/*
-		 * Send off activity statistics to the stats collector
+		 * Send off activity statistics to the stats collector. Since
+		 * LogStandbySnapshot() will generate the WAL, send the WAL stats too.
 		 */
-		pgstat_send_bgwriter();
+		pgstat_send_bgwriter(false);
+		pgstat_send_wal(false);
 
 		if (FirstCallSinceLastCheckpoint())
 		{
@@ -349,3 +353,36 @@ BackgroundWriterMain(void)
 		prev_hibernate = can_hibernate;
 	}
 }
+
+/*
+ * Interrupt handler for main loops of Background Writer process.
+ */
+static void
+HandleBackgroundWriterInterrupts(void)
+{
+	if (ProcSignalBarrierPending)
+		ProcessProcSignalBarrier();
+
+	if (ConfigReloadPending)
+	{
+		ConfigReloadPending = false;
+		ProcessConfigFile(PGC_SIGHUP);
+	}
+
+	if (ShutdownRequestPending)
+	{
+		/*
+		 * Force to send remaining statistics to the stats collector at
+		 * process exit.
+		 *
+		 * Since pgstat_send_bgwriter and pgstat_send_wal are invoked with
+		 * 'force' is false in main loop to avoid overloading to the stats
+		 * collector, there may exist unsent stats counters for the background
+		 * writer.
+		 */
+		pgstat_send_bgwriter(true);
+		pgstat_send_wal(true);
+
+		proc_exit(0);
+	}
+}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 94b55a17b3..c5f197434c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -502,10 +502,10 @@ CheckpointerMain(void)
 		 * worth the trouble to split the stats support into two independent
 		 * stats message types.)
 		 */
-		pgstat_send_bgwriter();
+		pgstat_send_bgwriter(true);
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
@@ -582,8 +582,8 @@ HandleCheckpointerInterrupts(void)
 		 */
 		BgWriterStats.m_requested_checkpoints++;
 		ShutdownXLOG(0, 0);
-		pgstat_send_bgwriter();
-		pgstat_report_wal();
+		pgstat_send_bgwriter(true);
+		pgstat_send_wal(true);
 
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
@@ -724,7 +724,7 @@ CheckpointWriteDelay(int flags, double progress)
 		/*
 		 * Report interim activity statistics to the stats collector.
 		 */
-		pgstat_send_bgwriter();
+		pgstat_send_bgwriter(true);
 
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index dd9136a942..6522cb311f 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -26,31 +26,12 @@
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile sig_atomic_t ShutdownRequestPending = false;
 
-/*
- * Simple interrupt handler for main loops of background processes.
- */
-void
-HandleMainLoopInterrupts(void)
-{
-	if (ProcSignalBarrierPending)
-		ProcessProcSignalBarrier();
-
-	if (ConfigReloadPending)
-	{
-		ConfigReloadPending = false;
-		ProcessConfigFile(PGC_SIGHUP);
-	}
-
-	if (ShutdownRequestPending)
-		proc_exit(0);
-}
-
 /*
  * Simple signal handler for triggering a configuration reload.
  *
  * Normally, this handler would be used for SIGHUP. The idea is that code
  * which uses it would arrange to check the ConfigReloadPending flag at
- * convenient places inside main loops, or else call HandleMainLoopInterrupts.
+ * convenient places inside main loops.
  */
 void
 SignalHandlerForConfigReload(SIGNAL_ARGS)
@@ -98,7 +79,7 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
- * main loop, or else the main loop should call HandleMainLoopInterrupts.
+ * main loop.
  */
 void
 SignalHandlerForShutdownRequest(SIGNAL_ARGS)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b1e2d94951..de16696dc6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -146,8 +146,8 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_report_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_report_wal() calls, by substracting
+ * pgstat_send_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_send_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
@@ -975,7 +975,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_report_wal();
+	pgstat_send_wal(true);
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -3118,7 +3118,7 @@ pgstat_initialize(void)
 	}
 
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
@@ -4643,13 +4643,19 @@ pgstat_send_archiver(const char *xlog, bool failed)
  * pgstat_send_bgwriter() -
  *
  *		Send bgwriter statistics to the collector
+ *
+ * If 'force' is not set, bgwriter stats message is only sent if enough time has
+ * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
+ *
+ * Return true if the message is sent, and false otherwise.
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_send_bgwriter(bool force)
 {
 	/* We assume this initializes to zeroes */
 	static const PgStat_MsgBgWriter all_zeroes;
+	static TimestampTz sendTime = 0;
 
 	/*
 	 * This function can be called even if nothing at all has happened. In
@@ -4659,6 +4665,19 @@ pgstat_send_bgwriter(void)
 	if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
 		return;
 
+	if (!force)
+	{
+		TimestampTz now = GetCurrentTimestamp();
+
+		/*
+		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+		 * msec since we last sent one.
+		 */
+		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
+			return;
+		sendTime = now;
+	}
+
 	/*
 	 * Prepare and send the message
 	 */
@@ -4672,19 +4691,25 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_report_wal() -
+ * pgstat_send_wal() -
  *
- * Calculate how much WAL usage counters are increased and send
- * WAL statistics to the collector.
+ *	Send WAL statistics to the collector.
  *
- * Must be called by processes that generate WAL.
+ * If 'force' is not set, WAL stats message is only sent if enough time has
+ * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
+ *
+ * Return true if the message is sent, and false otherwise.
  * ----------
  */
 void
-pgstat_report_wal(void)
+pgstat_send_wal(bool force)
 {
 	WalUsage	walusage;
 
+	/* We assume this initializes to zeroes */
+	static const PgStat_MsgWal all_zeroes;
+	static TimestampTz sendTime = 0;
+
 	/*
 	 * Calculate how much WAL usage counters are increased by substracting the
 	 * previous counters from the current ones. Fill the results in WAL stats
@@ -4697,43 +4722,13 @@ pgstat_report_wal(void)
 	WalStats.m_wal_fpi = walusage.wal_fpi;
 	WalStats.m_wal_bytes = walusage.wal_bytes;
 
-	/*
-	 * Send WAL stats message to the collector.
-	 */
-	if (!pgstat_send_wal(true))
-		return;
-
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-}
-
-/* ----------
- * pgstat_send_wal() -
- *
- *	Send WAL statistics to the collector.
- *
- * If 'force' is not set, WAL stats message is only sent if enough time has
- * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
- *
- * Return true if the message is sent, and false otherwise.
- * ----------
- */
-bool
-pgstat_send_wal(bool force)
-{
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
-	static TimestampTz sendTime = 0;
-
 	/*
 	 * This function can be called even if nothing at all has happened. In
 	 * this case, avoid sending a completely empty message to the stats
 	 * collector.
 	 */
 	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return false;
+		return;
 
 	if (!force)
 	{
@@ -4744,7 +4739,7 @@ pgstat_send_wal(bool force)
 		 * msec since we last sent one.
 		 */
 		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
-			return false;
+			return;
 		sendTime = now;
 	}
 
@@ -4759,7 +4754,10 @@ pgstat_send_wal(bool force)
 	 */
 	MemSet(&WalStats, 0, sizeof(WalStats));
 
-	return true;
+	/*
+	 * Save the current counters for the subsequent calculation of WAL usage.
+	 */
+	prevWalUsage = pgWalUsage;
 }
 
 /* ----------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be43c04802..993b774bb9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1600,9 +1600,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 									  void *recdata, uint32 len);
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
-extern void pgstat_report_wal(void);
-extern bool pgstat_send_wal(bool force);
+extern void pgstat_send_bgwriter(bool force);
+extern void pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/include/postmaster/interrupt.h b/src/include/postmaster/interrupt.h
index 85a1293ef1..0d333b819a 100644
--- a/src/include/postmaster/interrupt.h
+++ b/src/include/postmaster/interrupt.h
@@ -24,7 +24,6 @@
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 extern PGDLLIMPORT volatile sig_atomic_t ShutdownRequestPending;
 
-extern void HandleMainLoopInterrupts(void);
 extern void SignalHandlerForConfigReload(SIGNAL_ARGS);
 extern void SignalHandlerForCrashExit(SIGNAL_ARGS);
 extern void SignalHandlerForShutdownRequest(SIGNAL_ARGS);

#60

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#58)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/11 21:29, Masahiro Ikeda wrote:

On 2021-03-11 11:52, Fujii Masao wrote:

On 2021/03/11 9:38, Masahiro Ikeda wrote:

On 2021-03-10 17:08, Fujii Masao wrote:

IIUC the stats collector basically exits after checkpointer and
walwriter exit.
But there seems no guarantee that the stats collector processes
all the messages that other processes have sent during the
shutdown of
the server.

Thanks, I understood the above postmaster behaviors.

PMState manages the status and after checkpointer is exited, the
postmaster sends
SIGQUIT signal to the stats collector if the shutdown mode is smart
or fast.
(IIUC, although the postmaster kill the walsender, the archiver and
the stats collector at the same time, it's ok because the walsender
and the archiver doesn't send stats to the stats collector now.)

But, there might be a corner case to lose stats sent by background
workers like
the checkpointer before they exit (although this is not implemented
yet.)

For example,

1. checkpointer send the stats before it exit
2. stats collector receive the signal and break before processing
the stats message from checkpointer. In this case, 1's message
is lost.
3. stats collector writes the stats in the statsfiles and exit

Why don't you recheck the coming message is zero just before the
2th procedure?
(v17-0004-guarantee-to-collect-last-stats-messages.patch)

Yes, I was thinking the same. This is the straight-forward fix for
this issue.
The stats collector should process all the outstanding messages when
normal shutdown is requested, as the patch does. On the other hand,
if immediate shutdown is requested or emergency bailout (by
postmaster death)
is requested, maybe the stats collector should skip those
processings
and exit immediately.

But if we implement that, we would need to teach the stats collector
the shutdown type (i.e., normal shutdown or immediate one). Because
currently SIGQUIT is sent to the collector whichever shutdown is
requested,
and so the collector cannot distinguish the shutdown type. I'm
afraid that
change is a bit overkill for now.

BTW, I found that the collector calls pgstat_write_statsfiles() even
at
emergency bailout case, before exiting. It's not necessary to save
the stats to the file in that case because subsequent server startup
does
crash recovery and clears that stats file. So it's better to make
the collector exit immediately without calling
pgstat_write_statsfiles()
at emergency bailout case? Probably this should be discussed in
other
thread because it's different topic from the feature we're
discussing here,
though.

IIUC, only the stats collector has another hander for SIGQUIT
although
other background processes have a common hander for it and just call
_exit(2).
I thought to guarantee when TerminateChildren(SIGTERM) is invoked,
don't make stats
collector shutdown before other background processes are shutdown.

I will make another thread to discuss that the stats collector should
know the shutdown type or not.
If it should be, it's better to make the stats collector exit as soon
as possible if the shutdown type
is an immediate, and avoid losing the remaining stats if it's normal.

+1

I researched the past discussion related to writing the stats files when
the immediate
shutdown is requested. And I found the following thread([1]/messages/by-id/0A3221C70F24FB45833433255569204D1F5EF25A@G01JPEXMBYT05) although
the discussion is
stopped on 12/1/2016.

IIUC, the thread's consensus are

(1) To kill the stats collector soon before writing the stats file is
needed in some case
because there is a possibility that it takes a long time until the
failover happens.
The possible reasons are that disk write speed is slow, stats files
are big, and so on.

(2) It needs to change the behavior from removing all stats files when
the startup does
crash recovery because autovacuum uses the stats.

(3) It's ok that the stats collector exit without calling
pgstat_write_statsfiles() if
the stats file is written every X minutes (using wal or another
mechanism) and startup
process can restore the stats with slightly low freshness.

(4) It needs to find the way how to handle the (2)'s stats file when
deleting on PITR
rewind or stats collector crash happens.

So, I need to ping the threads. But I don't have any idea to handle (4)
yet...

[1]: /messages/by-id/0A3221C70F24FB45833433255569204D1F5EF25A@G01JPEXMBYT05
/messages/by-id/0A3221C70F24FB45833433255569204D1F5EF25A@G01JPEXMBYT05

I measured the timing of the above in my linux laptop using
v17-measure-timing.patch.
I don't have any strong opinion to handle this case since this
result shows to receive and processes
the messages takes too short time (less than 1ms) although the
stats collector receives the shutdown
signal in 5msec(099->104) after the checkpointer process exits.

Agreed.

```
1615421204.556 [checkpointer] DEBUG: received shutdown request
signal
1615421208.099 [checkpointer] DEBUG: proc_exit(-1): 0 callbacks to
make              # exit and send the messages
1615421208.099 [stats collector] DEBUG: process BGWRITER stats
message              # receive and process the messages
1615421208.099 [stats collector] DEBUG: process WAL stats message
1615421208.104 [postmaster] DEBUG: reaping dead processes
1615421208.104 [stats collector] DEBUG: received shutdown request
signal             # receive shutdown request from the postmaster
```

Of course, there is another direction; we can improve the stats
collector so
that it guarantees to collect all the sent stats messages. But
I'm afraid
this change might be big.

For example, implement to manage background process status in
shared memory and
the stats collector collects the stats until another background
process exits?

In my understanding, the statistics are not required high
accuracy,
it's ok to ignore them if the impact is not big.

If we guarantee high accuracy, another background process like
autovacuum launcher
must send the WAL stats because it accesses the system catalog
and might generate
WAL records due to HOT update even though the possibility is low.

I thought the impact is small because the time uncollected stats
are generated is
short compared to the time from startup. So, it's ok to ignore
the remaining stats
when the process exists.

I agree that it's not worth changing lots of code to collect such
stats.
But if we can implement that very simply, isn't it more worth
doing
that than current situation because we may be able to collect more
accurate stats.

Yes, I agree.
I attached the patch to send the stats before the wal writer and
the checkpointer exit.
(v17-0001-send-stats-for-walwriter-when-shutdown.patch,
v17-0002-send-stats-for-checkpointer-when-shutdown.patch)

Thanks for making those patches! Firstly I'm reading 0001 and 0002
patches.

Thanks for your comments and for making patches.
Here is the review comments for 0001 patch.
+/* Prototypes for private functions */
+static void HandleWalWriterInterrupts(void);
HandleWalWriterInterrupts() and HandleMainLoopInterrupts() are
almost the same.
So I don't think that we need to introduce
HandleWalWriterInterrupts(). Instead,
we can just call pgstat_send_wal(true) before
HandleMainLoopInterrupts()
if ShutdownRequestPending is true in the main loop. Attached is the
patch
I implemented that way. Thought?
I thought there is a corner case that can't send the stats like
You're right! So IMO your patch
(v17-0001-send-stats-for-walwriter-when-shutdown.patch) is better.

```
// First, ShutdownRequstPending = false

     if (ShutdownRequestPending)    // don't send the stats
         pgstat_send_wal(true);

// receive signal and ShutdownRequestPending became true

     HandleMainLoopInterrupts();   // proc exit without sending the
stats

```

Is it ok because it almost never occurs?

Here is the review comments for 0002 patch.

+static void pgstat_send_checkpointer(void);

I'm inclined to avoid adding the function with the prefix "pgstat_"
outside
pgstat.c. Instead, I'm ok to just call both pgstat_send_bgwriter()
and
pgstat_report_wal() directly after ShutdownXLOG(). Thought? Patch
attached.

Thanks. I agree.

Thanks for the review!

So, barring any objection, I will commit the changes for
walwriter and checkpointer. That is,

v17-0001-send-stats-for-walwriter-when-shutdown.patch
v17-0002-send-stats-for-checkpointer-when-shutdown_fujii.patch
I pushed these two patches.

Thanks a lot!

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#61

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#59)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/15 10:39, Masahiro Ikeda wrote:

Thanks, I understood get_sync_bit() checks the sync flags and
the write unit of generated wal data and replicated wal data is different.
(It's interesting optimization whether to use kernel cache or not.)

OK. Although I agree to separate the stats for the walrecever,
I want to hear opinions from other people too. I didn't change the patch.

Please feel to your comments.

What about applying the patch for common WAL write function like
XLogWriteFile(), separately from the patch for walreceiver's stats?
Seems the former reaches the consensus, so we can commit it firstly.
Also even only the former change is useful because which allows
walreceiver to report WALWrite wait event.

OK. I agree.

I wonder to change the error check ways depending on who calls this function?
Now, only the walreceiver checks (1)errno==0 and doesn't check (2)errno==ENITR.
Other processes are the opposite.

IIUC, it's appropriate that every process checks (1)(2).
Please let me know my understanding is wrong.

I'm thinking the same. Regarding (2), commit 79ce29c734 introduced
that code. According to the following commit log, it seems harmless
to retry on EINTR even walreceiver.

Also retry on EINTR. All signals used in the backend are flagged SA_RESTART
nowadays, so it shouldn't happen, but better to be defensive.

BTW, currently XLogWrite() increments IO timing even when pg_pwrite()
reports an error. But this is useless. Probably IO timing should be
incremented after the return code of pg_pwrite() is checked, instead?

Yes, I agree. I fixed it.
(v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for the patch!

  			nleft = nbytes;
  			do
  			{
-				errno = 0;
+				written = XLogWriteFile(openLogFile, from, nleft, (off_t) startoffset,
+										ThisTimeLineID, openLogSegNo, wal_segment_size);

Can we merge this do-while loop in XLogWrite() into the loop
in XLogWriteFile()?

If we do that, ISTM that the following codes are not necessary in XLogWrite().

nleft -= written;
from += written;

+ * 'segsize' is a segment size of WAL segment file.

Since segsize is always wal_segment_size, segsize argument seems
not necessary in XLogWriteFile().

+XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset,
+			  TimeLineID timelineid, XLogSegNo segno, int segsize)

Why did you use "const void *" instead of "char *" for *buf?

Regarding 0005 patch, I will review it later.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#62

ikedamsh

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#61)

2 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-03-19 16:30, Fujii Masao wrote:

On 2021/03/15 10:39, Masahiro Ikeda wrote:

Thanks, I understood get_sync_bit() checks the sync flags and
the write unit of generated wal data and replicated wal data is
different.
(It's interesting optimization whether to use kernel cache or not.)

OK. Although I agree to separate the stats for the walrecever,
I want to hear opinions from other people too. I didn't change the
patch.

Please feel to your comments.

What about applying the patch for common WAL write function like
XLogWriteFile(), separately from the patch for walreceiver's stats?
Seems the former reaches the consensus, so we can commit it firstly.
Also even only the former change is useful because which allows
walreceiver to report WALWrite wait event.

Agreed. I separated the patches.

If only the former is committed, my trivial concern is that there may be
a disadvantage, but no advantage for the standby server. It may lead to
performance degradation to the wal receiver by calling
INSTR_TIME_SET_CURRENT(), but the stats can't visible for users until the
latter patch is committed.

I think it's ok because this not happening in the case to disable the
"track_wal_io_timing" in the standby server. Although some users may start the
standby server using the backup which "track_wal_io_timing" is enabled in the
primary server, they will say it's ok since the users already accept the
performance degradation in the primary server.

OK. I agree.

I wonder to change the error check ways depending on who calls this
function?
Now, only the walreceiver checks (1)errno==0 and doesn't check
(2)errno==ENITR.
Other processes are the opposite.

IIUC, it's appropriate that every process checks (1)(2).
Please let me know my understanding is wrong.

I'm thinking the same. Regarding (2), commit 79ce29c734 introduced
that code. According to the following commit log, it seems harmless
to retry on EINTR even walreceiver.

Also retry on EINTR. All signals used in the backend are flagged
SA_RESTART
nowadays, so it shouldn't happen, but better to be defensive.

Thanks, I understood.

BTW, currently XLogWrite() increments IO timing even when pg_pwrite()
reports an error. But this is useless. Probably IO timing should be
incremented after the return code of pg_pwrite() is checked, instead?

Yes, I agree. I fixed it.
(v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for the patch!
nleft = nbytes;
do
{
-				errno = 0;
+				written = XLogWriteFile(openLogFile, from, nleft, (off_t) 
startoffset,
+										ThisTimeLineID, openLogSegNo, wal_segment_size);
Can we merge this do-while loop in XLogWrite() into the loop
in XLogWriteFile()?
If we do that, ISTM that the following codes are not necessary in
XLogWrite().

nleft -= written;
from += written;

OK, I fixed it.

+ * 'segsize' is a segment size of WAL segment file.

Since segsize is always wal_segment_size, segsize argument seems
not necessary in XLogWriteFile().

Right. I fixed it.

+XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset,
+			  TimeLineID timelineid, XLogSegNo segno, int segsize)
Why did you use "const void *" instead of "char *" for *buf?

I followed the argument of pg_pwrite().
But, I think "char *" is better, so fixed it.

Regarding 0005 patch, I will review it later.

Thanks.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v19-0003-Makes-the-wal-receiver-report-WAL-statistics.patchtext/x-diff; charset=UTF-8; name=v19-0003-Makes-the-wal-receiver-report-WAL-statistics.patchDownload

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a7a94d2a83..df028c5039 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -771,6 +771,9 @@ WalRcvDie(int code, Datum arg)
 	/* Ensure that all WAL records received are flushed to disk */
 	XLogWalRcvFlush(true);
 
+	/* Send WAL statistics to the stats collector before terminating */
+	pgstat_send_wal(true);
+
 	/* Mark ourselves inactive in shared memory */
 	SpinLockAcquire(&walrcv->mutex);
 	Assert(walrcv->walRcvState == WALRCV_STREAMING ||
@@ -910,6 +913,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveForceDone(xlogfname);
 				else
 					XLogArchiveNotify(xlogfname);
+
+				/*
+				 * Send WAL statistics to the stats collector when finishing
+				 * the current WAL segment file to avoid overloading it.
+				 */
+				pgstat_send_wal(false);
 			}
 			recvFile = -1;

v19-0006-merge-wal-write-function.patchtext/x-diff; charset=UTF-8; name=v19-0006-merge-wal-write-function.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bd5e787e55..4c7d90f1b9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2536,61 +2536,14 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
 			nleft = nbytes;
-			do
-			{
-				errno = 0;
-
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-
-				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-				written = pg_pwrite(openLogFile, from, nleft, startoffset);
-				pgstat_report_wait_end();
-
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	duration;
-
-					INSTR_TIME_SET_CURRENT(duration);
-					INSTR_TIME_SUBTRACT(duration, start);
-					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
-				}
-
-				WalStats.m_wal_write++;
-
-				if (written <= 0)
-				{
-					char		xlogfname[MAXFNAMELEN];
-					int			save_errno;
-
-					if (errno == EINTR)
-						continue;
-
-					save_errno = errno;
-					XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
-								 wal_segment_size);
-					errno = save_errno;
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not write to log file %s "
-									"at offset %u, length %zu: %m",
-									xlogfname, startoffset, nleft)));
-				}
-				nleft -= written;
-				from += written;
-				startoffset += written;
-			} while (nleft > 0);
+			written = XLogWriteFile(openLogFile, from, nleft, (off_t) startoffset,
+									ThisTimeLineID, openLogSegNo, true);
+			startoffset += written;
 
 			npages = 0;
 
@@ -2707,6 +2660,94 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	}
 }
 
+/*
+ * Issue pg_pwrite to write an WAL segment file.
+ *
+ * 'fd' is a file descriptor for the XLOG file to write
+ * 'buf' is a buffer starting address to write.
+ * 'nbyte' is a number of max bytes to write up.
+ * 'offset' is a offset of XLOG file to be set.
+ * 'timelineid' is a timeline ID of WAL segment file.
+ * 'segno' is a segment number of WAL segment file.
+ * 'write_all' is whether to write 'nbyte' exactly.
+ *
+ * Return the number of bytes written.
+ */
+int
+XLogWriteFile(int fd, char *buf, size_t nbyte, off_t offset,
+			  TimeLineID timelineid, XLogSegNo segno, bool write_all)
+{
+	int			written = 0;
+
+	/*
+	 * Loop until to write the buffer data or an error occurred.
+	 */
+	for (;;)
+	{
+		int			written_tmp;
+		instr_time	start;
+
+		errno = 0;
+
+		/* Measure I/O timing to write WAL data */
+		if (track_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
+		pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+		written_tmp = pg_pwrite(fd, buf, nbyte, offset);
+		pgstat_report_wait_end();
+
+		if (written_tmp <= 0)
+		{
+			char		xlogfname[MAXFNAMELEN];
+			int			save_errno;
+
+			/*
+			 * Retry on EINTR. All signals used in the backend and background
+			 * processes are flagged SA_RESTART, so it shouldn't happen, but
+			 * better to be defensive.
+			 */
+			if (errno == EINTR)
+				continue;
+
+			/* if write didn't set errno, assume no disk space */
+			if (errno == 0)
+				errno = ENOSPC;
+
+			save_errno = errno;
+			XLogFileName(xlogfname, timelineid, segno, wal_segment_size);
+			errno = save_errno;
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not write to log file %s "
+							"at offset %u, length %zu: %m",
+							xlogfname, (unsigned int) offset, nbyte)));
+		}
+
+		/*
+		 * Increment the I/O timing and the number of times WAL data were
+		 * written out to disk.
+		 */
+		if (track_wal_io_timing)
+		{
+			instr_time	duration;
+
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+		}
+
+		WalStats.m_wal_write++;
+
+		nbyte -= written_tmp;
+		buf += written_tmp;
+		written += written_tmp;
+
+		if (!write_all || nbyte <= 0)
+			return written;
+	}
+}
+
 /*
  * Record the LSN for an asynchronous transaction commit/abort
  * and nudge the WALWriter if there is work for it to do.
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a7a94d2a83..daf764446f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -868,7 +868,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 static void
 XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 {
-	int			startoff;
+	uint32		startoff;
 	int			byteswritten;
 
 	while (nbytes > 0)
@@ -929,27 +929,8 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 			segbytes = nbytes;
 
 		/* OK to write the logs */
-		errno = 0;
-
-		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
-		if (byteswritten <= 0)
-		{
-			char		xlogfname[MAXFNAMELEN];
-			int			save_errno;
-
-			/* if write didn't set errno, assume no disk space */
-			if (errno == 0)
-				errno = ENOSPC;
-
-			save_errno = errno;
-			XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-			errno = save_errno;
-			ereport(PANIC,
-					(errcode_for_file_access(),
-					 errmsg("could not write to log segment %s "
-							"at offset %u, length %lu: %m",
-							xlogfname, startoff, (unsigned long) segbytes)));
-		}
+		byteswritten = XLogWriteFile(recvFile, buf, segbytes, (off_t) startoff,
+									 recvFileTLI, recvSegNo, false);
 
 		/* Update state for write */
 		recptr += byteswritten;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12be..b562cfa4c1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -298,6 +298,10 @@ extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
 extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
 extern int	XLogFileOpen(XLogSegNo segno);
+extern int	XLogWriteFile(int fd, char *buf,
+						  size_t nbyte, off_t offset,
+						  TimeLineID timelineid, XLogSegNo segno,
+						  bool write_all);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);

#63

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: ikedamsh (#62)

1 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/22 9:50, ikedamsh wrote:

Agreed. I separated the patches.

If only the former is committed, my trivial concern is that there may be
a disadvantage, but no advantage for the standby server. It may lead to
performance degradation to the wal receiver by calling
INSTR_TIME_SET_CURRENT(), but the stats can't visible for users until the
latter patch is committed.

Your concern is valid, so let's polish and commit also the 0003 patch to v14.
I'm still thinking that it's better to separate wal_xxx columns into
walreceiver's and the others. But if we count even walreceiver activity on
the existing columns, regarding 0003 patch, we need to update the document?
For example, "Number of times WAL buffers were written out to disk via
XLogWrite request." should be "Number of times WAL buffers were written
out to disk via XLogWrite request and by WAL receiver process."? Maybe
we need to append some descriptions about this into "WAL configuration"
section?

I followed the argument of pg_pwrite().
But, I think "char *" is better, so fixed it.

Thanks for updating the patch!

+extern int	XLogWriteFile(int fd, char *buf,
+						  size_t nbyte, off_t offset,
+						  TimeLineID timelineid, XLogSegNo segno,
+						  bool write_all);

write_all seems not to be necessary. You added this flag for walreceiver,
I guess. But even without the argument, walreceiver seems to work expectedly.
So, what about the attached patch? I applied some cosmetic changes to the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

v20-0006-merge-wal-write-function.patchtext/plain; charset=UTF-8; name=v20-0006-merge-wal-write-function.patch; x-mac-creator=0; x-mac-type=0Download

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..9d8ea7edca 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2534,63 +2534,15 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 		{
 			char	   *from;
 			Size		nbytes;
-			Size		nleft;
 			int			written;
-			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
-			nleft = nbytes;
-			do
-			{
-				errno = 0;
-
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-
-				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-				written = pg_pwrite(openLogFile, from, nleft, startoffset);
-				pgstat_report_wait_end();
-
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	duration;
-
-					INSTR_TIME_SET_CURRENT(duration);
-					INSTR_TIME_SUBTRACT(duration, start);
-					WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
-				}
-
-				WalStats.m_wal_write++;
-
-				if (written <= 0)
-				{
-					char		xlogfname[MAXFNAMELEN];
-					int			save_errno;
-
-					if (errno == EINTR)
-						continue;
-
-					save_errno = errno;
-					XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
-								 wal_segment_size);
-					errno = save_errno;
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not write to log file %s "
-									"at offset %u, length %zu: %m",
-									xlogfname, startoffset, nleft)));
-				}
-				nleft -= written;
-				from += written;
-				startoffset += written;
-			} while (nleft > 0);
+			written = XLogWriteFile(openLogFile, from, nbytes, (off_t) startoffset,
+									ThisTimeLineID, openLogSegNo);
+			Assert(written == nbytes);
+			startoffset += written;
 
 			npages = 0;
 
@@ -2707,6 +2659,94 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	}
 }
 
+/*
+ * Write the WAL data to the WAL file.
+ *
+ * 'fd' is the file descriptor for the WAL file to write WAL to. 'buf' is
+ * the starting address of the buffer storing WAL data to write.
+ * 'nbytes' is the number of bytes to write WAL data up to. 'offset'
+ * is the offset of WAL file to be set. 'tli' and 'segno' are the
+ * timeline ID and segment number of WAL file.
+ *
+ * Return the total number of bytes written. This must be the same as
+ * 'nbytes'. PANIC error is thrown if WAL data fails to be written.
+ */
+int
+XLogWriteFile(int fd, char *buf, Size nbytes, off_t offset,
+			  TimeLineID tli, XLogSegNo segno)
+{
+	Size		total_written = 0;
+
+	/*
+	 * Loop until 'nbytes' bytes of the buffer data have been written or an
+	 * error occurs.
+	 */
+	do
+	{
+		int			written;
+		instr_time	start;
+
+		errno = 0;
+
+		/* Measure I/O timing to write WAL data */
+		if (track_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
+		pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+		written = pg_pwrite(fd, buf, nbytes, offset);
+		pgstat_report_wait_end();
+
+		if (written <= 0)
+		{
+			char		xlogfname[MAXFNAMELEN];
+			int			save_errno;
+
+			/*
+			 * Retry on EINTR. All signals used in the backend and background
+			 * processes are flagged SA_RESTART, so it shouldn't happen, but
+			 * better to be defensive.
+			 */
+			if (errno == EINTR)
+				continue;
+
+			/* if write didn't set errno, assume no disk space */
+			if (errno == 0)
+				errno = ENOSPC;
+
+			save_errno = errno;
+			XLogFileName(xlogfname, tli, segno, wal_segment_size);
+			errno = save_errno;
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not write to log file %s "
+							"at offset %u, length %zu: %m",
+							xlogfname, (unsigned int) offset, nbytes)));
+		}
+
+		/*
+		 * Increment the I/O timing and the number of times WAL data were
+		 * written out to disk.
+		 */
+		if (track_wal_io_timing)
+		{
+			instr_time	duration;
+
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+		}
+
+		WalStats.m_wal_write++;
+
+		nbytes -= written;
+		buf += written;
+		offset += written;
+		total_written += written;
+	} while (nbytes > 0);
+
+	return total_written;
+}
+
 /*
  * Record the LSN for an asynchronous transaction commit/abort
  * and nudge the WALWriter if there is work for it to do.
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index f719ab4f6d..e33565d859 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -871,7 +871,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 static void
 XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 {
-	int			startoff;
+	uint32		startoff;
 	int			byteswritten;
 
 	while (nbytes > 0)
@@ -938,27 +938,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 			segbytes = nbytes;
 
 		/* OK to write the logs */
-		errno = 0;
-
-		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
-		if (byteswritten <= 0)
-		{
-			char		xlogfname[MAXFNAMELEN];
-			int			save_errno;
-
-			/* if write didn't set errno, assume no disk space */
-			if (errno == 0)
-				errno = ENOSPC;
-
-			save_errno = errno;
-			XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-			errno = save_errno;
-			ereport(PANIC,
-					(errcode_for_file_access(),
-					 errmsg("could not write to log segment %s "
-							"at offset %u, length %lu: %m",
-							xlogfname, startoff, (unsigned long) segbytes)));
-		}
+		byteswritten = XLogWriteFile(recvFile, buf, segbytes, (off_t) startoff,
+									 recvFileTLI, recvSegNo);
+		Assert(byteswritten == segbytes);
 
 		/* Update state for write */
 		recptr += byteswritten;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12be..c1f3ddab89 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -298,6 +298,8 @@ extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
 extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
 extern int	XLogFileOpen(XLogSegNo segno);
+extern int	XLogWriteFile(int fd, char *buf, Size nbyte, off_t offset,
+						  TimeLineID tli, XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);

#64

ikedamsh

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#63)

1 attachment(s)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/22 16:50, Fujii Masao wrote:

On 2021/03/22 9:50, ikedamsh wrote:

Agreed. I separated the patches.

If only the former is committed, my trivial concern is that there may be
a disadvantage, but no advantage for the standby server. It may lead to
performance degradation to the wal receiver by calling
INSTR_TIME_SET_CURRENT(), but the stats can't visible for users until the
latter patch is committed.

Your concern is valid, so let's polish and commit also the 0003 patch to v14.
I'm still thinking that it's better to separate wal_xxx columns into
walreceiver's and the others. But if we count even walreceiver activity on
the existing columns, regarding 0003 patch, we need to update the document?
For example, "Number of times WAL buffers were written out to disk via
XLogWrite request." should be "Number of times WAL buffers were written
out to disk via XLogWrite request and by WAL receiver process."? Maybe
we need to append some descriptions about this into "WAL configuration"
section?

Agreed. Users can know whether the stats is for walreceiver or not. The
pg_stat_wal view in standby server shows for the walreceiver, and in primary
server it shows for the others. So, I updated the document.
(v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

I followed the argument of pg_pwrite().
But, I think "char *" is better, so fixed it.

Thanks for updating the patch!
+extern intï¿½ï¿½ï¿½ XLogWriteFile(int fd, char *buf,
+ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ size_t nbyte, off_t offset,
+ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ TimeLineID timelineid, XLogSegNo segno,
+ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ bool write_all);
write_all seems not to be necessary. You added this flag for walreceiver,
I guess. But even without the argument, walreceiver seems to work expectedly.
So, what about the attached patch? I applied some cosmetic changes to the patch.

Thanks a lot. Yes, "write_all" is unnecessary.
Your patch is looks good to me.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patchtext/x-patch; charset=UTF-8; name=v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index db4b4e460c..281b13b9fa 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3493,7 +3493,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para>
       <para>
        Number of times WAL buffers were written out to disk via
-       <function>XLogWrite</function> request.
+       <function>XLogWrite</function> request and WAL data were written
+       out to disk by the WAL receiver process.
        See <xref linkend="wal-configuration"/> for more information about
        the internal WAL function <function>XLogWrite</function>.
       </para></entry>
@@ -3521,7 +3522,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para>
       <para>
        Total amount of time spent writing WAL buffers to disk via
-       <function>XLogWrite</function> request, in milliseconds
+       <function>XLogWrite</function> request and WAL data to disk
+       by the WAL receiver process, in milliseconds
        (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
        otherwise zero).  This includes the sync time when
        <varname>wal_sync_method</varname> is either
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index ae4a3c1293..39e7028c96 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -769,7 +769,7 @@
   </para>
 
   <para>
-   There are two internal functions to write WAL data to disk:
+   There are two internal functions to write generated WAL data to disk:
    <function>XLogWrite</function> and <function>issue_xlog_fsync</function>.
    When <xref linkend="guc-track-wal-io-timing"/> is enabled, the total
    amounts of time <function>XLogWrite</function> writes and
@@ -795,7 +795,19 @@
    <function>issue_xlog_fsync</function> syncs WAL data to disk are also
    counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
    in <structname>pg_stat_wal</structname>, respectively.
-  </para>
+   To write replicated WAL data to disk by the WAL receiver is almost the same
+   as above except for some points. First, there is a dedicated code path for the
+   WAL receiver to write data although <function>issue_xlog_fsync</function> is
+   the same for syncing data.
+   Second, the WAL receiver writes replicated WAL data per bytes from the local
+   memory although the generated WAL data is written per WAL buffer pages.
+   The counters of <literal>wal_write</literal>, <literal>wal_sync</literal>,
+   <literal>wal_write_time</literal>, and <literal>wal_sync_time</literal> are
+   common statistics for writing/syncing both generated and replicated WAL data.
+   But, you can distinguish them because the generated WAL data is written/synced
+   in the primary server and the replicated WAL data is written/synced in
+   the standby server.
+   </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a7a94d2a83..df028c5039 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -771,6 +771,9 @@ WalRcvDie(int code, Datum arg)
 	/* Ensure that all WAL records received are flushed to disk */
 	XLogWalRcvFlush(true);
 
+	/* Send WAL statistics to the stats collector before terminating */
+	pgstat_send_wal(true);
+
 	/* Mark ourselves inactive in shared memory */
 	SpinLockAcquire(&walrcv->mutex);
 	Assert(walrcv->walRcvState == WALRCV_STREAMING ||
@@ -910,6 +913,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveForceDone(xlogfname);
 				else
 					XLogArchiveNotify(xlogfname);
+
+				/*
+				 * Send WAL statistics to the stats collector when finishing
+				 * the current WAL segment file to avoid overloading it.
+				 */
+				pgstat_send_wal(false);
 			}
 			recvFile = -1;

#65

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: ikedamsh (#64)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/22 20:25, ikedamsh wrote:

Agreed. Users can know whether the stats is for walreceiver or not. The
pg_stat_wal view in standby server shows for the walreceiver, and in primary
server it shows for the others. So, I updated the document.
(v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the docs!

There was the discussion about when the stats collector is invoked, at [1]/messages/by-id/e5a982a5-8bb4-5a10-cf9a-40dd1921bdb5@oss.nttdata.com.
Currently during archive recovery or standby, the stats collector is
invoked when the startup process reaches the consistent state, sends
PMSIGNAL_BEGIN_HOT_STANDBY, and then the system is starting accepting
read-only connections. But walreceiver can be invoked at earlier stage.
This can cause walreceiver to generate and send the statistics about WAL
writing even though the stats collector has not been running yet. This might
be problematic? If so, maybe we need to ensure that the stats collector is
invoked before walreceiver?

During recovery, the stats collector is not invoked if hot standby mode is
disabled. But walreceiver can be running in this case. So probably we should
change walreceiver so that it's invoked even when hot standby is disabled?
Otherwise we cannnot collect the statistics about WAL writing by walreceiver
in that case.

[1]: /messages/by-id/e5a982a5-8bb4-5a10-cf9a-40dd1921bdb5@oss.nttdata.com
/messages/by-id/e5a982a5-8bb4-5a10-cf9a-40dd1921bdb5@oss.nttdata.com

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#66

Masahiro Ikeda

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#65)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/23 16:10, Fujii Masao wrote:

On 2021/03/22 20:25, ikedamsh wrote:

Agreed. Users can know whether the stats is for walreceiver or not. The
pg_stat_wal view in standby server shows for the walreceiver, and in primary
server it shows for the others. So, I updated the document.
(v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the docs!

There was the discussion about when the stats collector is invoked, at [1].
Currently during archive recovery or standby, the stats collector is
invoked when the startup process reaches the consistent state, sends
PMSIGNAL_BEGIN_HOT_STANDBY, and then the system is starting accepting
read-only connections. But walreceiver can be invoked at earlier stage.
This can cause walreceiver to generate and send the statistics about WAL
writing even though the stats collector has not been running yet. This might
be problematic? If so, maybe we need to ensure that the stats collector is
invoked before walreceiver?

During recovery, the stats collector is not invoked if hot standby mode is
disabled. But walreceiver can be running in this case. So probably we should
change walreceiver so that it's invoked even when hot standby is disabled?
Otherwise we cannnot collect the statistics about WAL writing by walreceiver
in that case.

[1]
/messages/by-id/e5a982a5-8bb4-5a10-cf9a-40dd1921bdb5@oss.nttdata.com

Thanks for comments! I didn't notice that.
As I mentioned[1]/messages/by-id/9f4e19ad-518d-b91a-e500-25a666471c42@oss.nttdata.com, if my understanding is right, this issue seem to be not for
only the wal receiver.

Since the shared memory thread already handles these issues, does this patch,
which to collect the stats for the wal receiver and make a common function for
writing wal files, have to be committed after the patches for share memory
stats are committed? Or to handle them in this thread because we don't know
when the shared memory stats patches will be committed.

I think the former is better because to collect stats in shared memory is very
useful feature for users and it make a big change in design. So, I think it's
beneficial to make an effort to move the shared memory stats thread forward
(by reviewing or testing) instead of handling the issues in this thread.

[1]: /messages/by-id/9f4e19ad-518d-b91a-e500-25a666471c42@oss.nttdata.com
/messages/by-id/9f4e19ad-518d-b91a-e500-25a666471c42@oss.nttdata.com

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#67

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#66)

Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021/03/25 11:50, Masahiro Ikeda wrote:

On 2021/03/23 16:10, Fujii Masao wrote:

On 2021/03/22 20:25, ikedamsh wrote:

Agreed. Users can know whether the stats is for walreceiver or not. The
pg_stat_wal view in standby server shows for the walreceiver, and in primary
server it shows for the others. So, I updated the document.
(v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the docs!

There was the discussion about when the stats collector is invoked, at [1].
Currently during archive recovery or standby, the stats collector is
invoked when the startup process reaches the consistent state, sends
PMSIGNAL_BEGIN_HOT_STANDBY, and then the system is starting accepting
read-only connections. But walreceiver can be invoked at earlier stage.
This can cause walreceiver to generate and send the statistics about WAL
writing even though the stats collector has not been running yet. This might
be problematic? If so, maybe we need to ensure that the stats collector is
invoked before walreceiver?

During recovery, the stats collector is not invoked if hot standby mode is
disabled. But walreceiver can be running in this case. So probably we should
change walreceiver so that it's invoked even when hot standby is disabled?
Otherwise we cannnot collect the statistics about WAL writing by walreceiver
in that case.

[1]
/messages/by-id/e5a982a5-8bb4-5a10-cf9a-40dd1921bdb5@oss.nttdata.com

Thanks for comments! I didn't notice that.
As I mentioned[1], if my understanding is right, this issue seem to be not for
only the wal receiver.

Since the shared memory thread already handles these issues, does this patch,
which to collect the stats for the wal receiver and make a common function for
writing wal files, have to be committed after the patches for share memory
stats are committed? Or to handle them in this thread because we don't know
when the shared memory stats patches will be committed.

I think the former is better because to collect stats in shared memory is very
useful feature for users and it make a big change in design. So, I think it's
beneficial to make an effort to move the shared memory stats thread forward
(by reviewing or testing) instead of handling the issues in this thread.

Sounds reasonable. Agreed.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION