Track IO times in pg_stat_io

Started by Melanie Plagemanalmost 3 years ago23 messages
#1Melanie Plageman
melanieplageman@gmail.com
1 attachment(s)

Hi,

As suggested in [1]/messages/by-id/20230209050319.chyyup4vtq4jzobq@awork3.anarazel.de, the attached patch adds IO times to pg_stat_io;

I added docs but haven't added any tests. The timings will only be
non-zero when track_io_timing is on, and I only see tests with track IO
timing on in explain.sql and the IO timings I added to pg_stat_io would
not be visible there.

I didn't split it up into two patches (one with the changes to track IO
timing and 1 with the view additions and docs), because I figured the
overall diff is pretty small.

There is one minor question (in the code as a TODO) which is whether or
not it is worth cross-checking that IO counts and times are either both
zero or neither zero in the validation function
pgstat_bktype_io_stats_valid().

- Melanie

[1]: /messages/by-id/20230209050319.chyyup4vtq4jzobq@awork3.anarazel.de

Attachments:

v1-0001-Track-IO-times-in-pg_stat_io.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Track-IO-times-in-pg_stat_io.patchDownload
From f0c96e638e33f7404b44b936d5dfa6d4945b99d0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sat, 25 Feb 2023 18:09:10 -0500
Subject: [PATCH v1] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.
---
 doc/src/sgml/monitoring.sgml           | 48 +++++++++++++++
 src/backend/catalog/system_views.sql   |  4 ++
 src/backend/storage/buffer/bufmgr.c    | 34 +++++++++++
 src/backend/storage/buffer/localbuf.c  | 14 +++++
 src/backend/storage/smgr/md.c          | 30 ++++++++++
 src/backend/utils/activity/pgstat_io.c | 83 +++++++++++++++++++++-----
 src/backend/utils/adt/pgstatfuncs.c    | 40 +++++++++++--
 src/include/catalog/pg_proc.dat        |  6 +-
 src/include/pgstat.h                   |  5 +-
 src/test/regress/expected/rules.out    |  6 +-
 10 files changed, 246 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b0b997f092..e74d9c1cf1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in read operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>write_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in write operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3838,6 +3862,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extend_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in extend operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3902,6 +3938,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>fsync_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in fsync operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..39391bc2fc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1123,12 +1123,16 @@ SELECT
        b.io_object,
        b.io_context,
        b.reads,
+       b.read_time,
        b.writes,
+       b.write_time,
        b.extends,
+       b.extend_time,
        b.op_bytes,
        b.evictions,
        b.reuses,
        b.fsyncs,
+       b.fsync_time,
        b.stats_reset
 FROM pg_stat_get_io() b;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 98904a7c05..52302b317e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1000,11 +1000,28 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		instr_time	io_start,
+					io_time;
+
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
 
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_io_time(io_object, io_context, IOOP_EXTEND, io_time);
+		}
+
+
 		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
 
 		/*
@@ -1042,6 +1059,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				INSTR_TIME_SUBTRACT(io_time, io_start);
 				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+				pgstat_count_io_time(io_object, io_context, IOOP_READ, io_time);
 			}
 
 			/* check for garbage data */
@@ -2989,6 +3007,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		INSTR_TIME_SUBTRACT(io_time, io_start);
 		pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 		INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+		pgstat_count_io_time(IOOBJECT_RELATION, io_context, IOOP_WRITE, io_time);
 	}
 
 	pgBufferUsage.shared_blks_written++;
@@ -3594,6 +3613,9 @@ FlushRelationBuffers(Relation rel)
 
 	if (RelationUsesLocalBuffers(rel))
 	{
+		instr_time	io_start,
+					io_time;
+
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			uint32		buf_state;
@@ -3616,6 +3638,11 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+				if (track_io_timing)
+					INSTR_TIME_SET_CURRENT(io_start);
+				else
+					INSTR_TIME_SET_ZERO(io_start);
+
 				smgrwrite(RelationGetSmgr(rel),
 						  BufTagGetForkNum(&bufHdr->tag),
 						  bufHdr->tag.blockNum,
@@ -3627,6 +3654,13 @@ FlushRelationBuffers(Relation rel)
 
 				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 
+				if (track_io_timing)
+				{
+					INSTR_TIME_SET_CURRENT(io_time);
+					INSTR_TIME_SUBTRACT(io_time, io_start);
+					pgstat_count_io_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE, io_time);
+				}
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5325ddb663..b1272840bd 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -220,6 +220,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	if (buf_state & BM_DIRTY)
 	{
+		instr_time	io_start,
+					io_time;
 		SMgrRelation oreln;
 		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
@@ -228,6 +230,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
 		/* And write... */
 		smgrwrite(oreln,
 				  BufTagGetForkNum(&bufHdr->tag),
@@ -239,6 +246,13 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_io_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE, io_time);
+		}
+
 		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 		pgBufferUsage.local_blks_written++;
 	}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 8da813600c..7f644d6bc2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1030,6 +1030,14 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		instr_time	io_start,
+					io_time;
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
 		/*
 		 * We have no way of knowing if the current IOContext is
 		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
@@ -1051,6 +1059,14 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(seg->mdfd_vfd))));
+
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
+		}
+
 	}
 }
 
@@ -1399,6 +1415,8 @@ int
 mdsyncfiletag(const FileTag *ftag, char *path)
 {
 	SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
+	instr_time	io_start,
+				io_time;
 	File		file;
 	bool		need_to_close;
 	int			result,
@@ -1425,10 +1443,22 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
+
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
 	save_errno = errno;
 
+	if (track_io_timing)
+	{
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, io_start);
+		pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
+	}
+
 	if (need_to_close)
 		FileClose(file);
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 0e07e0848d..1386793479 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -25,17 +25,20 @@ bool		have_iostats = false;
 
 /*
  * Check that stats have not been counted for any combination of IOObject,
- * IOContext, and IOOp which are not tracked for the passed-in BackendType. The
- * passed-in PgStat_BktypeIO must contain stats from the BackendType specified
- * by the second parameter. Caller is responsible for locking the passed-in
- * PgStat_BktypeIO, if needed.
+ * IOContext, and IOOp which are not tracked for the passed-in BackendType. If
+ * the IOOp is not counted for this combination but IO time is otherwise
+ * tracked for this IOOp, check that IO time has not been counted for this
+ * combination.
+ * TODO: should we check that IO counts are not 0 if IO time is not zero?
+ *
+ * The passed-in PgStat_BktypeIO must contain stats from the BackendType
+ * specified by the second parameter. Caller is responsible for locking the
+ * passed-in PgStat_BktypeIO, if needed.
  */
 bool
 pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 							 BackendType bktype)
 {
-	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
-
 	for (IOObject io_object = IOOBJECT_FIRST;
 		 io_object < IOOBJECT_NUM_TYPES; io_object++)
 	{
@@ -49,14 +52,20 @@ pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 			 */
 			for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-				/* No stats, so nothing to validate */
-				if (backend_io->data[io_object][io_context][io_op] == 0)
+				/* we do track it */
+				if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
 					continue;
 
-				/* There are stats and there shouldn't be */
-				if (!bktype_tracked ||
-					!pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				/* we don't track it, and it is not 0 */
+				if (backend_io->counts[io_object][io_context][io_op] != 0)
 					return false;
+
+				/* we don't track this IOOp, so make sure its IO time is zero */
+				if (pgstat_tracks_io_time(io_op) > -1)
+				{
+					if (!INSTR_TIME_IS_ZERO(backend_io->times[io_object][io_context][io_op]))
+						return false;
+				}
 			}
 		}
 	}
@@ -72,7 +81,21 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
 	Assert(io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
-	PendingIOStats.data[io_object][io_context][io_op]++;
+	PendingIOStats.counts[io_object][io_context][io_op]++;
+
+	have_iostats = true;
+}
+
+void
+pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time)
+{
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
+	Assert(pgstat_tracks_io_time(io_op) != -1);
+
+	INSTR_TIME_ADD(PendingIOStats.times[io_object][io_context][io_op], time);
 
 	have_iostats = true;
 }
@@ -119,8 +142,13 @@ pgstat_flush_io(bool nowait)
 		{
 			for (IOOp io_op = IOOP_FIRST;
 				 io_op < IOOP_NUM_TYPES; io_op++)
-				bktype_shstats->data[io_object][io_context][io_op] +=
-					PendingIOStats.data[io_object][io_context][io_op];
+			{
+				bktype_shstats->counts[io_object][io_context][io_op] +=
+					PendingIOStats.counts[io_object][io_context][io_op];
+
+				INSTR_TIME_ADD(bktype_shstats->times[io_object][io_context][io_op],
+							   PendingIOStats.times[io_object][io_context][io_op]);
+			}
 		}
 	}
 
@@ -389,3 +417,30 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 
 	return true;
 }
+
+/*
+ * PgStat_BktypeIO->times contains IO times for IOOps. For simplicity this
+ * array has a spot for every IOOp. pgstat_tracks_io_time() is the source of
+ * truth for which IOOps have corresponding IO times.
+ */
+IOOp
+pgstat_tracks_io_time(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_READ:
+			return IOOP_READ;
+		case IOOP_WRITE:
+			return IOOP_WRITE;
+		case IOOP_EXTEND:
+			return IOOP_EXTEND;
+		case IOOP_FSYNC:
+			return IOOP_FSYNC;
+		case IOOP_EVICT:
+		case IOOP_REUSE:
+			return -1;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 9d707c3521..3bce59e543 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1255,12 +1255,16 @@ typedef enum io_stat_col
 	IO_COL_IO_OBJECT,
 	IO_COL_IO_CONTEXT,
 	IO_COL_READS,
+	IO_COL_READ_TIME,
 	IO_COL_WRITES,
+	IO_COL_WRITE_TIME,
 	IO_COL_EXTENDS,
+	IO_COL_EXTEND_TIME,
 	IO_COL_CONVERSION,
 	IO_COL_EVICTIONS,
 	IO_COL_REUSES,
 	IO_COL_FSYNCS,
+	IO_COL_FSYNC_TIME,
 	IO_COL_RESET_TIME,
 	IO_NUM_COLUMNS,
 } io_stat_col;
@@ -1292,6 +1296,21 @@ pgstat_get_io_op_index(IOOp io_op)
 	pg_unreachable();
 }
 
+/*
+ * Get the number of the column containing IO times for the specified IOOp. If
+ * the specified IOOp is one for which IO time is not tracked, return -1. Note
+ * that this function assumes that IO time for an IOOp is displayed in the view
+ * in the column directly after the IOOp counts.
+ */
+static io_stat_col
+pgstat_get_io_time_index(IOOp io_op)
+{
+	if (pgstat_tracks_io_time(io_op) == -1)
+		return -1;
+
+	return pgstat_get_io_op_index(io_op) + 1;
+}
+
 Datum
 pg_stat_get_io(PG_FUNCTION_ARGS)
 {
@@ -1361,20 +1380,31 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 
 				for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
 				{
-					int			col_idx = pgstat_get_io_op_index(io_op);
+					int			i = pgstat_get_io_op_index(io_op);
 
 					/*
 					 * Some combinations of BackendType and IOOp, of IOContext
 					 * and IOOp, and of IOObject and IOOp are not tracked. Set
 					 * these cells in the view NULL.
 					 */
-					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+					if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+						values[i] = Int64GetDatum(bktype_stats->counts[io_obj][io_context][io_op]);
+					else
+						nulls[i] = true;
+				}
+
+				for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			i = pgstat_get_io_time_index(io_op);
 
-					if (nulls[col_idx])
+					if (i == -1)
 						continue;
 
-					values[col_idx] =
-						Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
+					if (!nulls[pgstat_get_io_op_index(io_op)])
+						values[i] =
+							Float8GetDatum(INSTR_TIME_GET_MILLISEC(bktype_stats->times[io_obj][io_context][io_op]));
+					else
+						nulls[i] = true;
 				}
 
 				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index e2a7642a2b..9bf9f55db4 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5721,9 +5721,9 @@
   proname => 'pg_stat_get_io', provolatile => 'v',
   prorows => '30', proretset => 't',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,io_object,io_context,reads,writes,extends,op_bytes,evictions,reuses,fsyncs,stats_reset}',
+  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_object,io_context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index db9675884f..1e1b792a48 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -316,7 +316,8 @@ typedef enum IOOp
 
 typedef struct PgStat_BktypeIO
 {
-	PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	instr_time	times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_IO
@@ -510,6 +511,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
 										 BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern void pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time);
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
@@ -519,6 +521,7 @@ extern bool pgstat_tracks_io_object(BackendType bktype,
 									IOObject io_object, IOContext io_context);
 extern bool pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 								IOContext io_context, IOOp io_op);
+extern IOOp pgstat_tracks_io_time(IOOp io_op);
 
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e953d1f515..5434851314 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1880,14 +1880,18 @@ pg_stat_io| SELECT backend_type,
     io_object,
     io_context,
     reads,
+    read_time,
     writes,
+    write_time,
     extends,
+    extend_time,
     op_bytes,
     evictions,
     reuses,
     fsyncs,
+    fsync_time,
     stats_reset
-   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, writes, extends, op_bytes, evictions, reuses, fsyncs, stats_reset);
+   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, evictions, reuses, fsyncs, fsync_time, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.37.2

#2Drouvot, Bertrand
bertranddrouvot.pg@gmail.com
In reply to: Melanie Plageman (#1)
1 attachment(s)
Re: Track IO times in pg_stat_io

Hi,

On 2/26/23 5:03 PM, Melanie Plageman wrote:

Hi,

As suggested in [1], the attached patch adds IO times to pg_stat_io;

Thanks for the patch!

I started to have a look at it and figured out that a tiny rebase was needed (due to
728560db7d and b9f0e54bc9), so please find the rebase (aka V2) attached.

The timings will only be non-zero when track_io_timing is on

That could lead to incorrect interpretation if one wants to divide the timing per operations, say:

- track_io_timing is set to on while there is already operations
- or set to off while it was on (and the number of operations keeps growing)

Might be worth to warn/highlight in the "track_io_timing" doc?

+               if (track_io_timing)
+               {
+                       INSTR_TIME_SET_CURRENT(io_time);
+                       INSTR_TIME_SUBTRACT(io_time, io_start);
+                       pgstat_count_io_time(io_object, io_context, IOOP_EXTEND, io_time);
+               }
+
+
                 pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);

vs

@@ -1042,6 +1059,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
                                 INSTR_TIME_SUBTRACT(io_time, io_start);
                                 pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
                                 INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+                               pgstat_count_io_time(io_object, io_context, IOOP_READ, io_time);
                         }

That leads to pgstat_count_io_time() to be called before pgstat_count_io_op() (for the IOOP_EXTEND case) and
after pgstat_count_io_op() (for the IOOP_READ case).

What about calling them in the same order and so that pgstat_count_io_time() is called before pgstat_count_io_op()?

If so, the ordering would also need to be changed in:

- FlushRelationBuffers()
- register_dirty_segment()

There is one minor question (in the code as a TODO) which is whether or
not it is worth cross-checking that IO counts and times are either both
zero or neither zero in the validation function
pgstat_bktype_io_stats_valid().

As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
to also check that if counts are not Zero then times are not Zero.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-Track-IO-times-in-pg_stat_io.patchtext/plain; charset=UTF-8; name=v2-0001-Track-IO-times-in-pg_stat_io.patchDownload
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d0..2c62b0a437 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in read operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>write_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in write operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3838,6 +3862,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extend_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in extend operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3902,6 +3938,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>fsync_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in fsync operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..39391bc2fc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1123,12 +1123,16 @@ SELECT
        b.io_object,
        b.io_context,
        b.reads,
+       b.read_time,
        b.writes,
+       b.write_time,
        b.extends,
+       b.extend_time,
        b.op_bytes,
        b.evictions,
        b.reuses,
        b.fsyncs,
+       b.fsync_time,
        b.stats_reset
 FROM pg_stat_get_io() b;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0a05577b68..bbd2af9fae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1000,11 +1000,26 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		instr_time  io_start,
+					io_time;
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, bufBlock, false);
 
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_io_time(io_object, io_context, IOOP_EXTEND, io_time);
+		}
+
 		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
 
 		/*
@@ -1042,6 +1057,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				INSTR_TIME_SUBTRACT(io_time, io_start);
 				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+				pgstat_count_io_time(io_object, io_context, IOOP_READ, io_time);
 			}
 
 			/* check for garbage data */
@@ -2989,6 +3005,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		INSTR_TIME_SUBTRACT(io_time, io_start);
 		pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 		INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+		pgstat_count_io_time(IOOBJECT_RELATION, io_context, IOOP_WRITE, io_time);
 	}
 
 	pgBufferUsage.shared_blks_written++;
@@ -3594,6 +3611,9 @@ FlushRelationBuffers(Relation rel)
 
 	if (RelationUsesLocalBuffers(rel))
 	{
+		instr_time	io_start,
+					io_time;
+
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			uint32		buf_state;
@@ -3616,6 +3636,11 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+				if (track_io_timing)
+					INSTR_TIME_SET_CURRENT(io_start);
+				else
+					INSTR_TIME_SET_ZERO(io_start);
+
 				smgrwrite(RelationGetSmgr(rel),
 						  BufTagGetForkNum(&bufHdr->tag),
 						  bufHdr->tag.blockNum,
@@ -3627,6 +3652,13 @@ FlushRelationBuffers(Relation rel)
 
 				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 
+				if (track_io_timing)
+				{
+					INSTR_TIME_SET_CURRENT(io_time);
+					INSTR_TIME_SUBTRACT(io_time, io_start);
+					pgstat_count_io_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE, io_time);
+				}
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5325ddb663..b1272840bd 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -220,6 +220,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	if (buf_state & BM_DIRTY)
 	{
+		instr_time	io_start,
+					io_time;
 		SMgrRelation oreln;
 		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
@@ -228,6 +230,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
 		/* And write... */
 		smgrwrite(oreln,
 				  BufTagGetForkNum(&bufHdr->tag),
@@ -239,6 +246,13 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_io_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE, io_time);
+		}
+
 		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 		pgBufferUsage.local_blks_written++;
 	}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1fe..1e2f4bc9a9 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1030,6 +1030,14 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		instr_time	io_start,
+					io_time;
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
 		/*
 		 * We have no way of knowing if the current IOContext is
 		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
@@ -1051,6 +1059,14 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(seg->mdfd_vfd))));
+
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
+		}
+
 	}
 }
 
@@ -1399,6 +1415,8 @@ int
 mdsyncfiletag(const FileTag *ftag, char *path)
 {
 	SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
+	instr_time	io_start,
+				io_time;
 	File		file;
 	bool		need_to_close;
 	int			result,
@@ -1425,10 +1443,22 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
+
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
 	save_errno = errno;
 
+	if (track_io_timing)
+	{
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, io_start);
+		pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
+	}
+
 	if (need_to_close)
 		FileClose(file);
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index c4199d18c8..75e34be982 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -25,17 +25,20 @@ bool		have_iostats = false;
 
 /*
  * Check that stats have not been counted for any combination of IOObject,
- * IOContext, and IOOp which are not tracked for the passed-in BackendType. The
- * passed-in PgStat_BktypeIO must contain stats from the BackendType specified
- * by the second parameter. Caller is responsible for locking the passed-in
- * PgStat_BktypeIO, if needed.
+ * IOContext, and IOOp which are not tracked for the passed-in BackendType. If
+ * the IOOp is not counted for this combination but IO time is otherwise
+ * tracked for this IOOp, check that IO time has not been counted for this
+ * combination.
+ * TODO: should we check that IO counts are not 0 if IO time is not zero?
+ *
+ * The passed-in PgStat_BktypeIO must contain stats from the BackendType
+ * specified by the second parameter. Caller is responsible for locking the
+ * passed-in PgStat_BktypeIO, if needed.
  */
 bool
 pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 							 BackendType bktype)
 {
-	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
-
 	for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
 	{
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
@@ -47,14 +50,21 @@ pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 			 */
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-				/* No stats, so nothing to validate */
-				if (backend_io->data[io_object][io_context][io_op] == 0)
+				/* we do track it */
+				if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
 					continue;
 
-				/* There are stats and there shouldn't be */
-				if (!bktype_tracked ||
-					!pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				/* we don't track it, and it is not 0 */
+				if (backend_io->counts[io_object][io_context][io_op] != 0)
 					return false;
+
+				/* we don't track this IOOp, so make sure its IO time is zero */
+				if (pgstat_tracks_io_time(io_op) > -1)
+				{
+					if (!INSTR_TIME_IS_ZERO(backend_io->times[io_object][io_context][io_op]))
+						return false;
+				}
+
 			}
 		}
 	}
@@ -70,7 +80,21 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
 	Assert(io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
-	PendingIOStats.data[io_object][io_context][io_op]++;
+	PendingIOStats.counts[io_object][io_context][io_op]++;
+
+	have_iostats = true;
+}
+
+void
+pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time)
+{
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
+	Assert(pgstat_tracks_io_time(io_op) != -1);
+
+	INSTR_TIME_ADD(PendingIOStats.times[io_object][io_context][io_op], time);
 
 	have_iostats = true;
 }
@@ -114,8 +138,13 @@ pgstat_flush_io(bool nowait)
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
-				bktype_shstats->data[io_object][io_context][io_op] +=
-					PendingIOStats.data[io_object][io_context][io_op];
+			{
+				bktype_shstats->counts[io_object][io_context][io_op] +=
+					PendingIOStats.counts[io_object][io_context][io_op];
+
+				INSTR_TIME_ADD(bktype_shstats->times[io_object][io_context][io_op],
+							   PendingIOStats.times[io_object][io_context][io_op]);
+			}
 		}
 	}
 
@@ -384,3 +413,30 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 
 	return true;
 }
+
+/*
+ * PgStat_BktypeIO->times contains IO times for IOOps. For simplicity this
+ * array has a spot for every IOOp. pgstat_tracks_io_time() is the source of
+ * truth for which IOOps have corresponding IO times.
+ */
+IOOp
+pgstat_tracks_io_time(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_READ:
+			return IOOP_READ;
+		case IOOP_WRITE:
+			return IOOP_WRITE;
+		case IOOP_EXTEND:
+			return IOOP_EXTEND;
+		case IOOP_FSYNC:
+			return IOOP_FSYNC;
+		case IOOP_EVICT:
+		case IOOP_REUSE:
+			return -1;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b61a12382b..5731010ade 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1255,12 +1255,16 @@ typedef enum io_stat_col
 	IO_COL_IO_OBJECT,
 	IO_COL_IO_CONTEXT,
 	IO_COL_READS,
+	IO_COL_READ_TIME,
 	IO_COL_WRITES,
+	IO_COL_WRITE_TIME,
 	IO_COL_EXTENDS,
+	IO_COL_EXTEND_TIME,
 	IO_COL_CONVERSION,
 	IO_COL_EVICTIONS,
 	IO_COL_REUSES,
 	IO_COL_FSYNCS,
+	IO_COL_FSYNC_TIME,
 	IO_COL_RESET_TIME,
 	IO_NUM_COLUMNS,
 } io_stat_col;
@@ -1292,6 +1296,21 @@ pgstat_get_io_op_index(IOOp io_op)
 	pg_unreachable();
 }
 
+/*
+ * Get the number of the column containing IO times for the specified IOOp. If
+ * the specified IOOp is one for which IO time is not tracked, return -1. Note
+ * that this function assumes that IO time for an IOOp is displayed in the view
+ * in the column directly after the IOOp counts.
+ */
+static io_stat_col
+pgstat_get_io_time_index(IOOp io_op)
+{
+	if (pgstat_tracks_io_time(io_op) == -1)
+		return -1;
+
+	return pgstat_get_io_op_index(io_op) + 1;
+}
+
 Datum
 pg_stat_get_io(PG_FUNCTION_ARGS)
 {
@@ -1359,20 +1378,31 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
-					int			col_idx = pgstat_get_io_op_index(io_op);
+					int         i = pgstat_get_io_op_index(io_op);
 
 					/*
 					 * Some combinations of BackendType and IOOp, of IOContext
 					 * and IOOp, and of IOObject and IOOp are not tracked. Set
 					 * these cells in the view NULL.
 					 */
-					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+					if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+						values[i] = Int64GetDatum(bktype_stats->counts[io_obj][io_context][io_op]);
+					else
+						nulls[i] = true;
+				}
+
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int         i = pgstat_get_io_time_index(io_op);
 
-					if (nulls[col_idx])
+					if (i == -1)
 						continue;
 
-					values[col_idx] =
-						Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
+					if (!nulls[pgstat_get_io_op_index(io_op)])
+						values[i] =
+							Float8GetDatum(INSTR_TIME_GET_MILLISEC(bktype_stats->times[io_obj][io_context][io_op]));
+					else
+						nulls[i] = true;
 				}
 
 				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 505595620e..80c0627209 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5721,9 +5721,9 @@
   proname => 'pg_stat_get_io', provolatile => 'v',
   prorows => '30', proretset => 't',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,io_object,io_context,reads,writes,extends,op_bytes,evictions,reuses,fsyncs,stats_reset}',
+  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_object,io_context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f43fac09ed..687aff4859 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -313,7 +313,8 @@ typedef enum IOOp
 
 typedef struct PgStat_BktypeIO
 {
-	PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	instr_time	times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_IO
@@ -507,6 +508,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
 										 BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern void pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time);
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
@@ -516,6 +518,7 @@ extern bool pgstat_tracks_io_object(BackendType bktype,
 									IOObject io_object, IOContext io_context);
 extern bool pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 								IOContext io_context, IOOp io_op);
+extern IOOp pgstat_tracks_io_time(IOOp io_op);
 
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e953d1f515..5434851314 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1880,14 +1880,18 @@ pg_stat_io| SELECT backend_type,
     io_object,
     io_context,
     reads,
+    read_time,
     writes,
+    write_time,
     extends,
+    extend_time,
     op_bytes,
     evictions,
     reuses,
     fsyncs,
+    fsync_time,
     stats_reset
-   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, writes, extends, op_bytes, evictions, reuses, fsyncs, stats_reset);
+   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, evictions, reuses, fsyncs, fsync_time, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
#3Melanie Plageman
melanieplageman@gmail.com
In reply to: Drouvot, Bertrand (#2)
1 attachment(s)
Re: Track IO times in pg_stat_io

Thanks for the review!

On Tue, Feb 28, 2023 at 4:49 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 2/26/23 5:03 PM, Melanie Plageman wrote:

As suggested in [1], the attached patch adds IO times to pg_stat_io;

Thanks for the patch!

I started to have a look at it and figured out that a tiny rebase was needed (due to
728560db7d and b9f0e54bc9), so please find the rebase (aka V2) attached.

Thanks for doing that!

The timings will only be non-zero when track_io_timing is on

That could lead to incorrect interpretation if one wants to divide the timing per operations, say:

- track_io_timing is set to on while there is already operations
- or set to off while it was on (and the number of operations keeps growing)

Might be worth to warn/highlight in the "track_io_timing" doc?

This is a good point. I've added a note to the docs for pg_stat_io.

+               if (track_io_timing)
+               {
+                       INSTR_TIME_SET_CURRENT(io_time);
+                       INSTR_TIME_SUBTRACT(io_time, io_start);
+                       pgstat_count_io_time(io_object, io_context, IOOP_EXTEND, io_time);
+               }
+
+
pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);

vs

@@ -1042,6 +1059,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
INSTR_TIME_SUBTRACT(io_time, io_start);
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+                               pgstat_count_io_time(io_object, io_context, IOOP_READ, io_time);
}

That leads to pgstat_count_io_time() to be called before pgstat_count_io_op() (for the IOOP_EXTEND case) and
after pgstat_count_io_op() (for the IOOP_READ case).

What about calling them in the same order and so that pgstat_count_io_time() is called before pgstat_count_io_op()?

If so, the ordering would also need to be changed in:

- FlushRelationBuffers()
- register_dirty_segment()

Yes, good point. I've updated the code to use this suggested ordering in
attached v3.

There is one minor question (in the code as a TODO) which is whether or
not it is worth cross-checking that IO counts and times are either both
zero or neither zero in the validation function
pgstat_bktype_io_stats_valid().

As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
to also check that if counts are not Zero then times are not Zero.

Yes, I think adding some validation around the relationship between
counts and timing should help prevent developers from forgetting to call
pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).

However, I think that we cannot check that if IO counts are non-zero
that IO times are non-zero, because the user may not have
track_io_timing enabled. We can check that if IO times are not zero, IO
counts are not zero. I've done this in the attached v3.

- Melanie

Attachments:

v3-0001-Track-IO-times-in-pg_stat_io.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Track-IO-times-in-pg_stat_io.patchDownload
From 52d997001108a52c833b339f9b8dcb3d34ed3270 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 6 Mar 2023 10:41:51 -0500
Subject: [PATCH v3] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.

Reviewed-by: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml           | 59 ++++++++++++++++
 src/backend/catalog/system_views.sql   |  4 ++
 src/backend/storage/buffer/bufmgr.c    | 40 +++++++++--
 src/backend/storage/buffer/localbuf.c  | 14 ++++
 src/backend/storage/smgr/md.c          | 47 ++++++++++---
 src/backend/utils/activity/pgstat_io.c | 96 +++++++++++++++++++++-----
 src/backend/utils/adt/pgstatfuncs.c    | 40 +++++++++--
 src/include/catalog/pg_proc.dat        |  6 +-
 src/include/pgstat.h                   |  5 +-
 src/test/regress/expected/rules.out    |  6 +-
 10 files changed, 275 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d0..ad3667f258 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in read operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>write_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in write operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3838,6 +3862,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extend_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in extend operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3902,6 +3938,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>fsync_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in fsync operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3967,6 +4015,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    </itemizedlist>
   </para>
 
+  <note>
+   <para>
+    Columns tracking I/O time will only be non-zero when <xref
+    linkend="guc-track-io-timing"/> is enabled. The user should be careful when
+    using these columns in combination with their corresponding operations to
+    ensure that <varname>track_io_timing</varname> was enabled for the entire
+    time since the last reset.
+   </para>
+  </note>
+
+
 
  </sect2>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..39391bc2fc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1123,12 +1123,16 @@ SELECT
        b.io_object,
        b.io_context,
        b.reads,
+       b.read_time,
        b.writes,
+       b.write_time,
        b.extends,
+       b.extend_time,
        b.op_bytes,
        b.evictions,
        b.reuses,
        b.fsyncs,
+       b.fsync_time,
        b.stats_reset
 FROM pg_stat_get_io() b;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0a05577b68..91aa52848a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		instr_time	io_start,
+					io_time;
+
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, bufBlock, false);
 
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_io_time(io_object, io_context, IOOP_EXTEND, io_time);
+		}
+
 		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
 
 		/*
@@ -1034,16 +1050,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, bufBlock);
 
-			pgstat_count_io_op(io_object, io_context, IOOP_READ);
-
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
 				INSTR_TIME_SUBTRACT(io_time, io_start);
 				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+				pgstat_count_io_time(io_object, io_context, IOOP_READ, io_time);
 			}
 
+			pgstat_count_io_op(io_object, io_context, IOOP_READ);
+
 			/* check for garbage data */
 			if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
 										PIV_LOG_WARNING | PIV_REPORT_STAT))
@@ -2981,16 +2998,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * When a strategy is not in use, the write can only be a "regular" write
 	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
 	 */
-	pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
-
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, io_start);
 		pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 		INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+		pgstat_count_io_time(IOOBJECT_RELATION, io_context, IOOP_WRITE, io_time);
 	}
 
+	pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
 	pgBufferUsage.shared_blks_written++;
 
 	/*
@@ -3594,6 +3611,9 @@ FlushRelationBuffers(Relation rel)
 
 	if (RelationUsesLocalBuffers(rel))
 	{
+		instr_time	io_start,
+					io_time;
+
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			uint32		buf_state;
@@ -3616,6 +3636,11 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+				if (track_io_timing)
+					INSTR_TIME_SET_CURRENT(io_start);
+				else
+					INSTR_TIME_SET_ZERO(io_start);
+
 				smgrwrite(RelationGetSmgr(rel),
 						  BufTagGetForkNum(&bufHdr->tag),
 						  bufHdr->tag.blockNum,
@@ -3625,6 +3650,13 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+				if (track_io_timing)
+				{
+					INSTR_TIME_SET_CURRENT(io_time);
+					INSTR_TIME_SUBTRACT(io_time, io_start);
+					pgstat_count_io_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE, io_time);
+				}
+
 				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 
 				/* Pop the error context stack */
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5325ddb663..b1272840bd 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -220,6 +220,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	if (buf_state & BM_DIRTY)
 	{
+		instr_time	io_start,
+					io_time;
 		SMgrRelation oreln;
 		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
@@ -228,6 +230,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
 		/* And write... */
 		smgrwrite(oreln,
 				  BufTagGetForkNum(&bufHdr->tag),
@@ -239,6 +246,13 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_io_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE, io_time);
+		}
+
 		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 		pgBufferUsage.local_blks_written++;
 	}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1fe..052875d86a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1030,6 +1030,30 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		instr_time	io_start,
+					io_time;
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(seg->mdfd_vfd))));
+
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
+		}
+
 		/*
 		 * We have no way of knowing if the current IOContext is
 		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
@@ -1042,15 +1066,6 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 		 * backend fsyncs.
 		 */
 		pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
-
-		ereport(DEBUG1,
-				(errmsg_internal("could not forward fsync request because request queue is full")));
-
-		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(data_sync_elevel(ERROR),
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->mdfd_vfd))));
 	}
 }
 
@@ -1399,6 +1414,8 @@ int
 mdsyncfiletag(const FileTag *ftag, char *path)
 {
 	SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
+	instr_time	io_start,
+				io_time;
 	File		file;
 	bool		need_to_close;
 	int			result,
@@ -1425,10 +1442,22 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
+
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
 	save_errno = errno;
 
+	if (track_io_timing)
+	{
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, io_start);
+		pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
+	}
+
 	if (need_to_close)
 		FileClose(file);
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index af5d554610..4a151afed6 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -25,36 +25,48 @@ bool		have_iostats = false;
 
 /*
  * Check that stats have not been counted for any combination of IOObject,
- * IOContext, and IOOp which are not tracked for the passed-in BackendType. The
- * passed-in PgStat_BktypeIO must contain stats from the BackendType specified
- * by the second parameter. Caller is responsible for locking the passed-in
- * PgStat_BktypeIO, if needed.
+ * IOContext, and IOOp which are not tracked for the passed-in BackendType. If
+ * the IOOp is not counted for this combination but IO time is otherwise
+ * tracked for this IOOp, check that IO time has not been counted for this
+ * combination. If stats are tracked for this combination and IO times are
+ * non-zero, counts should be non-zero.
+ *
+ * The passed-in PgStat_BktypeIO must contain stats from the BackendType
+ * specified by the second parameter. Caller is responsible for locking the
+ * passed-in PgStat_BktypeIO, if needed.
  */
 bool
 pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 							 BackendType bktype)
 {
-	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
-
 	for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
 	{
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
-			/*
-			 * Don't bother trying to skip to the next loop iteration if
-			 * pgstat_tracks_io_object() would return false here. We still
-			 * need to validate that each counter is zero anyway.
-			 */
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-				/* No stats, so nothing to validate */
-				if (backend_io->data[io_object][io_context][io_op] == 0)
+				/* we do track it */
+				if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				{
+					/* ensure that if IO times are non-zero, counts are > 0 */
+					if (!INSTR_TIME_IS_ZERO(backend_io->times[io_object][io_context][io_op]) &&
+						backend_io->counts[io_object][io_context][io_op] <= 0)
+						return false;
+
 					continue;
+				}
 
-				/* There are stats and there shouldn't be */
-				if (!bktype_tracked ||
-					!pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				/* we don't track it, and it is not 0 */
+				if (backend_io->counts[io_object][io_context][io_op] != 0)
 					return false;
+
+				/* we don't track this IOOp, so make sure its IO time is zero */
+				if (pgstat_tracks_io_time(io_op) > -1)
+				{
+					if (!INSTR_TIME_IS_ZERO(backend_io->times[io_object][io_context][io_op]))
+						return false;
+				}
+
 			}
 		}
 	}
@@ -70,7 +82,21 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
-	PendingIOStats.data[io_object][io_context][io_op]++;
+	PendingIOStats.counts[io_object][io_context][io_op]++;
+
+	have_iostats = true;
+}
+
+void
+pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time)
+{
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
+	Assert(pgstat_tracks_io_time(io_op) != -1);
+
+	INSTR_TIME_ADD(PendingIOStats.times[io_object][io_context][io_op], time);
 
 	have_iostats = true;
 }
@@ -114,8 +140,13 @@ pgstat_flush_io(bool nowait)
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
-				bktype_shstats->data[io_object][io_context][io_op] +=
-					PendingIOStats.data[io_object][io_context][io_op];
+			{
+				bktype_shstats->counts[io_object][io_context][io_op] +=
+					PendingIOStats.counts[io_object][io_context][io_op];
+
+				INSTR_TIME_ADD(bktype_shstats->times[io_object][io_context][io_op],
+							   PendingIOStats.times[io_object][io_context][io_op]);
+			}
 		}
 	}
 
@@ -384,3 +415,30 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 
 	return true;
 }
+
+/*
+ * PgStat_BktypeIO->times contains IO times for IOOps. For simplicity this
+ * array has a spot for every IOOp. pgstat_tracks_io_time() is the source of
+ * truth for which IOOps have corresponding IO times.
+ */
+IOOp
+pgstat_tracks_io_time(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_READ:
+			return IOOP_READ;
+		case IOOP_WRITE:
+			return IOOP_WRITE;
+		case IOOP_EXTEND:
+			return IOOP_EXTEND;
+		case IOOP_FSYNC:
+			return IOOP_FSYNC;
+		case IOOP_EVICT:
+		case IOOP_REUSE:
+			return -1;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b61a12382b..c75e661290 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1255,12 +1255,16 @@ typedef enum io_stat_col
 	IO_COL_IO_OBJECT,
 	IO_COL_IO_CONTEXT,
 	IO_COL_READS,
+	IO_COL_READ_TIME,
 	IO_COL_WRITES,
+	IO_COL_WRITE_TIME,
 	IO_COL_EXTENDS,
+	IO_COL_EXTEND_TIME,
 	IO_COL_CONVERSION,
 	IO_COL_EVICTIONS,
 	IO_COL_REUSES,
 	IO_COL_FSYNCS,
+	IO_COL_FSYNC_TIME,
 	IO_COL_RESET_TIME,
 	IO_NUM_COLUMNS,
 } io_stat_col;
@@ -1292,6 +1296,21 @@ pgstat_get_io_op_index(IOOp io_op)
 	pg_unreachable();
 }
 
+/*
+ * Get the number of the column containing IO times for the specified IOOp. If
+ * the specified IOOp is one for which IO time is not tracked, return -1. Note
+ * that this function assumes that IO time for an IOOp is displayed in the view
+ * in the column directly after the IOOp counts.
+ */
+static io_stat_col
+pgstat_get_io_time_index(IOOp io_op)
+{
+	if (pgstat_tracks_io_time(io_op) == -1)
+		return -1;
+
+	return pgstat_get_io_op_index(io_op) + 1;
+}
+
 Datum
 pg_stat_get_io(PG_FUNCTION_ARGS)
 {
@@ -1359,20 +1378,31 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
-					int			col_idx = pgstat_get_io_op_index(io_op);
+					int			i = pgstat_get_io_op_index(io_op);
 
 					/*
 					 * Some combinations of BackendType and IOOp, of IOContext
 					 * and IOOp, and of IOObject and IOOp are not tracked. Set
 					 * these cells in the view NULL.
 					 */
-					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+					if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+						values[i] = Int64GetDatum(bktype_stats->counts[io_obj][io_context][io_op]);
+					else
+						nulls[i] = true;
+				}
+
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			i = pgstat_get_io_time_index(io_op);
 
-					if (nulls[col_idx])
+					if (i == -1)
 						continue;
 
-					values[col_idx] =
-						Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
+					if (!nulls[pgstat_get_io_op_index(io_op)])
+						values[i] =
+							Float8GetDatum(INSTR_TIME_GET_MILLISEC(bktype_stats->times[io_obj][io_context][io_op]));
+					else
+						nulls[i] = true;
 				}
 
 				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 505595620e..80c0627209 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5721,9 +5721,9 @@
   proname => 'pg_stat_get_io', provolatile => 'v',
   prorows => '30', proretset => 't',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,io_object,io_context,reads,writes,extends,op_bytes,evictions,reuses,fsyncs,stats_reset}',
+  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_object,io_context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f43fac09ed..687aff4859 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -313,7 +313,8 @@ typedef enum IOOp
 
 typedef struct PgStat_BktypeIO
 {
-	PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	instr_time	times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_IO
@@ -507,6 +508,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
 										 BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern void pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time);
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
@@ -516,6 +518,7 @@ extern bool pgstat_tracks_io_object(BackendType bktype,
 									IOObject io_object, IOContext io_context);
 extern bool pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 								IOContext io_context, IOOp io_op);
+extern IOOp pgstat_tracks_io_time(IOOp io_op);
 
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e953d1f515..5434851314 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1880,14 +1880,18 @@ pg_stat_io| SELECT backend_type,
     io_object,
     io_context,
     reads,
+    read_time,
     writes,
+    write_time,
     extends,
+    extend_time,
     op_bytes,
     evictions,
     reuses,
     fsyncs,
+    fsync_time,
     stats_reset
-   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, writes, extends, op_bytes, evictions, reuses, fsyncs, stats_reset);
+   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, evictions, reuses, fsyncs, fsync_time, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.37.2

#4Drouvot, Bertrand
bertranddrouvot.pg@gmail.com
In reply to: Melanie Plageman (#3)
Re: Track IO times in pg_stat_io

Hi,

On 3/6/23 5:30 PM, Melanie Plageman wrote:

Thanks for the review!

On Tue, Feb 28, 2023 at 4:49 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 2/26/23 5:03 PM, Melanie Plageman wrote:

The timings will only be non-zero when track_io_timing is on

That could lead to incorrect interpretation if one wants to divide the timing per operations, say:

- track_io_timing is set to on while there is already operations
- or set to off while it was on (and the number of operations keeps growing)

Might be worth to warn/highlight in the "track_io_timing" doc?

This is a good point. I've added a note to the docs for pg_stat_io.

Thanks!

Now I've a second thought: what do you think about resetting the related number
of operations and *_time fields when enabling/disabling track_io_timing? (And mention it in the doc).

That way it'd prevent bad interpretation (at least as far the time per operation metrics are concerned).

Thinking that way as we'd loose some (most?) benefits of the new *_time columns
if one can't "trust" their related operations and/or one is not sampling pg_stat_io frequently enough (to discard the samples
where the track_io_timing changes occur).

But well, resetting the operations could also lead to bad interpretation about the operations...

Not sure about which approach I like the most yet, what do you think?

That leads to pgstat_count_io_time() to be called before pgstat_count_io_op() (for the IOOP_EXTEND case) and
after pgstat_count_io_op() (for the IOOP_READ case).

What about calling them in the same order and so that pgstat_count_io_time() is called before pgstat_count_io_op()?

If so, the ordering would also need to be changed in:

- FlushRelationBuffers()
- register_dirty_segment()

Yes, good point. I've updated the code to use this suggested ordering in
attached v3.

Thanks, this looks good to me.

As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
to also check that if counts are not Zero then times are not Zero.

Yes, I think adding some validation around the relationship between
counts and timing should help prevent developers from forgetting to call
pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).

However, I think that we cannot check that if IO counts are non-zero
that IO times are non-zero, because the user may not have
track_io_timing enabled.

Yeah, right.

We can check that if IO times are not zero, IO
counts are not zero. I've done this in the attached v3.

Thanks, looks good to me.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#5Andres Freund
andres@anarazel.de
In reply to: Melanie Plageman (#3)
Re: Track IO times in pg_stat_io

Hi,

On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:

As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
to also check that if counts are not Zero then times are not Zero.

Yes, I think adding some validation around the relationship between
counts and timing should help prevent developers from forgetting to call
pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).

However, I think that we cannot check that if IO counts are non-zero
that IO times are non-zero, because the user may not have
track_io_timing enabled. We can check that if IO times are not zero, IO
counts are not zero. I've done this in the attached v3.

And even if track_io_timing is enabled, the timer granularity might be so low
that we *still* get zeroes.

I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,

@@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,

if (isExtend)
{
+		instr_time	io_start,
+					io_time;
+
/* new buffers are zero-filled */
MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+

I wonder if there's an argument for tracking this in the existing IO stats as
well. But I guess we've lived with this for a long time...

@@ -2981,16 +2998,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
* When a strategy is not in use, the write can only be a "regular" write
* of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
*/
-	pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
-
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
INSTR_TIME_SUBTRACT(io_time, io_start);
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+		pgstat_count_io_time(IOOBJECT_RELATION, io_context, IOOP_WRITE, io_time);
}

I think this needs a bit of cleanup - pgstat_count_buffer_write_time(),
pgBufferUsage.blk_write_time++, pgstat_count_io_time() is a bit excessive. We
might not be able to reduce the whole duplication at this point, but at least
it should be a bit more centralized.

+ pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
pgBufferUsage.shared_blks_written++;

/*
@@ -3594,6 +3611,9 @@ FlushRelationBuffers(Relation rel)

if (RelationUsesLocalBuffers(rel))
{
+ instr_time io_start,
+ io_time;
+
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
@@ -3616,6 +3636,11 @@ FlushRelationBuffers(Relation rel)

PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);

+				if (track_io_timing)
+					INSTR_TIME_SET_CURRENT(io_start);
+				else
+					INSTR_TIME_SET_ZERO(io_start);
+
smgrwrite(RelationGetSmgr(rel),
BufTagGetForkNum(&bufHdr->tag),
bufHdr->tag.blockNum,

I don't think you need the INSTR_TIME_SET_ZERO() in the body of the loop, to
silence the compiler warnings you can do it one level up.

@@ -228,6 +230,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,

PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);

+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
/* And write... */
smgrwrite(oreln,
BufTagGetForkNum(&bufHdr->tag),
@@ -239,6 +246,13 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
buf_state &= ~BM_DIRTY;
pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_io_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE, io_time);
+		}
+
pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
pgBufferUsage.local_blks_written++;
}

Perhaps we can instead introduce a FlushLocalBuffer()? Then we don't need this
in multiple write paths.

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1fe..052875d86a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1030,6 +1030,30 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
{
+		instr_time	io_start,
+					io_time;
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(seg->mdfd_vfd))));
+
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
+		}
+
/*
* We have no way of knowing if the current IOContext is
* IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
@@ -1042,15 +1066,6 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
* backend fsyncs.
*/
pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
-
-		ereport(DEBUG1,
-				(errmsg_internal("could not forward fsync request because request queue is full")));
-
-		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(data_sync_elevel(ERROR),
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->mdfd_vfd))));
}
}

@@ -1399,6 +1414,8 @@ int
mdsyncfiletag(const FileTag *ftag, char *path)
{
SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
+ instr_time io_start,
+ io_time;
File file;
bool need_to_close;
int result,
@@ -1425,10 +1442,22 @@ mdsyncfiletag(const FileTag *ftag, char *path)
need_to_close = true;
}

+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
+
/* Sync the file. */
result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
save_errno = errno;
+	if (track_io_timing)
+	{
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, io_start);
+		pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
+	}
+
if (need_to_close)
FileClose(file);

Perhaps we could have mdsyncfd(), used by both mdsyncfiletag() and
register_dirty_segment()?

@@ -1359,20 +1378,31 @@ pg_stat_get_io(PG_FUNCTION_ARGS)

for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
{
-					int			col_idx = pgstat_get_io_op_index(io_op);
+					int			i = pgstat_get_io_op_index(io_op);
/*
* Some combinations of BackendType and IOOp, of IOContext
* and IOOp, and of IOObject and IOOp are not tracked. Set
* these cells in the view NULL.
*/
-					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+					if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+						values[i] = Int64GetDatum(bktype_stats->counts[io_obj][io_context][io_op]);
+					else
+						nulls[i] = true;
+				}

These lines were already too long, and it's getting worse with this change.

typedef struct PgStat_BktypeIO
{
-	PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	instr_time	times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
} PgStat_BktypeIO;

Ah, you're going to hate me. We can't store instr_time on disk. There's
another patch that gets substantial peformance gains by varying the frequency
at which instr_time keeps track of time based on the CPU frequency... It also
just doesn't have enough range to keep track of system wide time on a larger
system. A single backend won't run for 293 years, but with a few thousand
backends that's a whole different story.

I think we need to accumulate in instr_time, but convert to floating point
when flushing stats.

Greetings,

Andres Freund

#6Melanie Plageman
melanieplageman@gmail.com
In reply to: Drouvot, Bertrand (#4)
Re: Track IO times in pg_stat_io

Thanks for taking another look!

On Tue, Mar 7, 2023 at 10:52 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 3/6/23 5:30 PM, Melanie Plageman wrote:

Thanks for the review!

On Tue, Feb 28, 2023 at 4:49 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 2/26/23 5:03 PM, Melanie Plageman wrote:

The timings will only be non-zero when track_io_timing is on

That could lead to incorrect interpretation if one wants to divide the timing per operations, say:

- track_io_timing is set to on while there is already operations
- or set to off while it was on (and the number of operations keeps growing)

Might be worth to warn/highlight in the "track_io_timing" doc?

This is a good point. I've added a note to the docs for pg_stat_io.

Thanks!

Now I've a second thought: what do you think about resetting the related number
of operations and *_time fields when enabling/disabling track_io_timing? (And mention it in the doc).

That way it'd prevent bad interpretation (at least as far the time per operation metrics are concerned).

Thinking that way as we'd loose some (most?) benefits of the new *_time columns
if one can't "trust" their related operations and/or one is not sampling pg_stat_io frequently enough (to discard the samples
where the track_io_timing changes occur).

But well, resetting the operations could also lead to bad interpretation about the operations...

Not sure about which approach I like the most yet, what do you think?

Oh, this is an interesting idea. I think you are right about the
synchronization issues making the statistics untrustworthy and, thus,
unuseable.

Building on your idea, what if we had the times be NULL instead of zero
when track_io_timing is disabled? Then as you suggested, when you enable
track_io_timing, it resets the IOOp counts and starts the times off at
zero. However, disabling track_io_timing would only NULL out the times
and not zero out the counts.

We could also, as you say, log these events.

- Melanie

#7Andres Freund
andres@anarazel.de
In reply to: Melanie Plageman (#6)
Re: Track IO times in pg_stat_io

On 2023-03-07 13:43:28 -0500, Melanie Plageman wrote:

Now I've a second thought: what do you think about resetting the related number
of operations and *_time fields when enabling/disabling track_io_timing? (And mention it in the doc).

That way it'd prevent bad interpretation (at least as far the time per operation metrics are concerned).

Thinking that way as we'd loose some (most?) benefits of the new *_time columns
if one can't "trust" their related operations and/or one is not sampling pg_stat_io frequently enough (to discard the samples
where the track_io_timing changes occur).

But well, resetting the operations could also lead to bad interpretation about the operations...

Not sure about which approach I like the most yet, what do you think?

Oh, this is an interesting idea. I think you are right about the
synchronization issues making the statistics untrustworthy and, thus,
unuseable.

No, I don't think we can do that. It can be enabled on a per-session basis.

I think we simply shouldn't do anything here. This is a pre-existing issue. I
also think that loosing stats when turning track_io_timing on/off would not be
helpful.

Greetings,

Andres Freund

#8Drouvot, Bertrand
bertranddrouvot.pg@gmail.com
In reply to: Andres Freund (#7)
Re: Track IO times in pg_stat_io

Hi,

On 3/7/23 7:47 PM, Andres Freund wrote:

On 2023-03-07 13:43:28 -0500, Melanie Plageman wrote:

Now I've a second thought: what do you think about resetting the related number
of operations and *_time fields when enabling/disabling track_io_timing? (And mention it in the doc).

That way it'd prevent bad interpretation (at least as far the time per operation metrics are concerned).

Thinking that way as we'd loose some (most?) benefits of the new *_time columns
if one can't "trust" their related operations and/or one is not sampling pg_stat_io frequently enough (to discard the samples
where the track_io_timing changes occur).

But well, resetting the operations could also lead to bad interpretation about the operations...

Not sure about which approach I like the most yet, what do you think?

Oh, this is an interesting idea. I think you are right about the
synchronization issues making the statistics untrustworthy and, thus,
unuseable.

No, I don't think we can do that. It can be enabled on a per-session basis.

Oh right. So it's even less clear to me to get how one would make use of those new *_time fields, given that:

- pg_stat_io is "global" across all sessions. So, even if one session is doing some "testing" and needs to turn track_io_timing on, then it
is even not sure it's only reflecting its own testing (as other sessions may have turned it on too).

- There is the risk mentioned above of bad interpretations for the "time per operation" metrics.

- Even if there is frequent enough sampling of it pg_stat_io, one does not know which samples contain track_io_timing changes (at the cluster or session level).

I think we simply shouldn't do anything here. This is a pre-existing issue.

Oh, never thought about it. You mean like for pg_stat_database.blks_read and pg_stat_database.blk_read_time for example?

I also think that loosing stats when turning track_io_timing on/off would not be
helpful.

Yeah not 100% sure too as that would lead to other possible bad interpretations.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#9Andres Freund
andres@anarazel.de
In reply to: Drouvot, Bertrand (#8)
Re: Track IO times in pg_stat_io

Hi,

On 2023-03-08 12:55:34 +0100, Drouvot, Bertrand wrote:

On 3/7/23 7:47 PM, Andres Freund wrote:

On 2023-03-07 13:43:28 -0500, Melanie Plageman wrote:

Now I've a second thought: what do you think about resetting the related number
of operations and *_time fields when enabling/disabling track_io_timing? (And mention it in the doc).

That way it'd prevent bad interpretation (at least as far the time per operation metrics are concerned).

Thinking that way as we'd loose some (most?) benefits of the new *_time columns
if one can't "trust" their related operations and/or one is not sampling pg_stat_io frequently enough (to discard the samples
where the track_io_timing changes occur).

But well, resetting the operations could also lead to bad interpretation about the operations...

Not sure about which approach I like the most yet, what do you think?

Oh, this is an interesting idea. I think you are right about the
synchronization issues making the statistics untrustworthy and, thus,
unuseable.

No, I don't think we can do that. It can be enabled on a per-session basis.

Oh right. So it's even less clear to me to get how one would make use of those new *_time fields, given that:

- pg_stat_io is "global" across all sessions. So, even if one session is doing some "testing" and needs to turn track_io_timing on, then it
is even not sure it's only reflecting its own testing (as other sessions may have turned it on too).

I think for 17 we should provide access to per-existing-connection pg_stat_io
stats, and also provide a database aggregated version. Neither should be
particularly hard.

- There is the risk mentioned above of bad interpretations for the "time per operation" metrics.

- Even if there is frequent enough sampling of it pg_stat_io, one does not know which samples contain track_io_timing changes (at the cluster or session level).

You'd just make the same use of them you do with pg_stat_database.blks_read
etc today.

I don't think it's particularly useful to use the time to calculate "per IO"
costs - they can vary *drastically* due to kernel level buffering. The point
of having the time available is that it provides information that the number
of operations doesn't provide.

I think we simply shouldn't do anything here. This is a pre-existing issue.

Oh, never thought about it. You mean like for pg_stat_database.blks_read and pg_stat_database.blk_read_time for example?

Yes.

Greetings,

Andres Freund

#10Drouvot, Bertrand
bertranddrouvot.pg@gmail.com
In reply to: Andres Freund (#9)
Re: Track IO times in pg_stat_io

Hi,

On 3/9/23 1:34 AM, Andres Freund wrote:

Hi,

On 2023-03-08 12:55:34 +0100, Drouvot, Bertrand wrote:

On 3/7/23 7:47 PM, Andres Freund wrote:

On 2023-03-07 13:43:28 -0500, Melanie Plageman wrote:
No, I don't think we can do that. It can be enabled on a per-session basis.

Oh right. So it's even less clear to me to get how one would make use of those new *_time fields, given that:

- pg_stat_io is "global" across all sessions. So, even if one session is doing some "testing" and needs to turn track_io_timing on, then it
is even not sure it's only reflecting its own testing (as other sessions may have turned it on too).

I think for 17 we should provide access to per-existing-connection pg_stat_io
stats, and also provide a database aggregated version. Neither should be
particularly hard.

+1 that would be great.

I don't think it's particularly useful to use the time to calculate "per IO"
costs - they can vary *drastically* due to kernel level buffering.

Exactly and I think that's the reason why it could be useful. I think that could help (with frequent enough sampling)
to try to identify when the IOs are served by the page cache or not (if one knows his infra well enough).

One could say (for example, depending on his environment) that if the read_time > 4ms then the IO is served by spindle disks (if any)
and if <<< ms then by the page cache.

What I mean is that one could try to characterized their IOs based on threshold that they could define.

Adding/reporting histograms in the game would be even better: something we could look for for 17?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#11Imseih (AWS), Sami
simseih@amazon.com
In reply to: Drouvot, Bertrand (#8)
Re: Track IO times in pg_stat_io

Now I've a second thought: what do you think about resetting the related number
of operations and *_time fields when enabling/disabling track_io_timing? (And mention it in the doc).

That way it'd prevent bad interpretation (at least as far the time per operation metrics are concerned).

Thinking that way as we'd loose some (most?) benefits of the new *_time columns
if one can't "trust" their related operations and/or one is not sampling pg_stat_io frequently enough (to discard the samples
where the track_io_timing changes occur).

But well, resetting the operations could also lead to bad interpretation about the operations...

Not sure about which approach I like the most yet, what do you think?

Oh, this is an interesting idea. I think you are right about the
synchronization issues making the statistics untrustworthy and, thus,
unuseable.

No, I don't think we can do that. It can be enabled on a per-session basis.

Oh right. So it's even less clear to me to get how one would make use of those new *_time fields, given that:

- pg_stat_io is "global" across all sessions. So, even if one session is doing some "testing" and needs to turn track_io_timing on, then it
is even not sure it's only reflecting its own testing (as other sessions may have turned it on too).

- There is the risk mentioned above of bad interpretations for the "time per operation" metrics.

- Even if there is frequent enough sampling of it pg_stat_io, one does not know which samples contain track_io_timing changes (at the cluster or session level).

As long as track_io_timing can be toggled, blk_write_time could lead to wrong conclusions.
I think it may be helpful to track the blks_read when track_io_timing is enabled
Separately.

blks_read will be as is and give the overall blks_read, while a new column
blks_read_with_timing will only report on blks_read with track_io_timing enabled.

blks_read_with_timing should never be larger than blks_read.

This will then make the blks_read_time valuable if it's looked at with
the blks_read_with_timing column.

Regards,

--

Sami Imseih
Amazon Web Services (AWS)

#12Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#5)
1 attachment(s)
Re: Track IO times in pg_stat_io

Hi, v4 attached addresses these review comments.

On Tue, Mar 7, 2023 at 1:39 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:

As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
to also check that if counts are not Zero then times are not Zero.

Yes, I think adding some validation around the relationship between
counts and timing should help prevent developers from forgetting to call
pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).

However, I think that we cannot check that if IO counts are non-zero
that IO times are non-zero, because the user may not have
track_io_timing enabled. We can check that if IO times are not zero, IO
counts are not zero. I've done this in the attached v3.

And even if track_io_timing is enabled, the timer granularity might be so low
that we *still* get zeroes.

I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,

And then have pg_stat_reset_shared('io') reset pg_stat_database IO
stats?

@@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,

if (isExtend)
{
+             instr_time      io_start,
+                                     io_time;
+
/* new buffers are zero-filled */
MemSet((char *) bufBlock, 0, BLCKSZ);
+
+             if (track_io_timing)
+                     INSTR_TIME_SET_CURRENT(io_start);
+             else
+                     INSTR_TIME_SET_ZERO(io_start);
+

I wonder if there's an argument for tracking this in the existing IO stats as
well. But I guess we've lived with this for a long time...

Not sure I want to include that in this patchset.

@@ -2981,16 +2998,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
* When a strategy is not in use, the write can only be a "regular" write
* of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
*/
-     pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
-
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
INSTR_TIME_SUBTRACT(io_time, io_start);
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+             pgstat_count_io_time(IOOBJECT_RELATION, io_context, IOOP_WRITE, io_time);
}

I think this needs a bit of cleanup - pgstat_count_buffer_write_time(),
pgBufferUsage.blk_write_time++, pgstat_count_io_time() is a bit excessive. We
might not be able to reduce the whole duplication at this point, but at least
it should be a bit more centralized.

So, in the attached v4, I've introduced pgstat_io_start() and
pgstat_io_end(...). The end IO function takes the IOObject, IOOp, and
IOContext, in addition to the start_time, so that we know which
pgBufferUsage field to increment and which pgstat_count_buffer_*_time()
to call.

I will note that calling this function now causes pgBufferUsage and
pgStatBlock*Time to be incremented in a couple of places that they were
not before. I think those might have been accidental omissions, so I
think it is okay.

The exception is pgstat_count_write_time() being only called for
relations in shared buffers and not temporary relations while
pgstat_count_buffer_read_time() is called for temporary relations and
relations in shared buffers. I left that behavior as is, though it seems
like it is wrong.

I added pgstat_io_start() to pgstat.c -- not sure if it is best there.

I could separate it into a commit that does this refactoring of the
existing counting (without adding pgstat_count_io_time()) and then
another that adds pgstat_count_io_time(). I hesitated to do that until I
knew that the new functions were viable.

+ pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
pgBufferUsage.shared_blks_written++;

/*
@@ -3594,6 +3611,9 @@ FlushRelationBuffers(Relation rel)

if (RelationUsesLocalBuffers(rel))
{
+             instr_time      io_start,
+                                     io_time;
+
for (i = 0; i < NLocBuffer; i++)
{
uint32          buf_state;
@@ -3616,6 +3636,11 @@ FlushRelationBuffers(Relation rel)

PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);

+                             if (track_io_timing)
+                                     INSTR_TIME_SET_CURRENT(io_start);
+                             else
+                                     INSTR_TIME_SET_ZERO(io_start);
+
smgrwrite(RelationGetSmgr(rel),
BufTagGetForkNum(&bufHdr->tag),
bufHdr->tag.blockNum,

I don't think you need the INSTR_TIME_SET_ZERO() in the body of the loop, to
silence the compiler warnings you can do it one level up.

So, I didn't move it out because I am using pgstat_io_start() which does
set zero. However, I could eschew the pgstat_io_start() helper function
and just do what is in the function inline. Do you think the overhead of
set zero is worth it?

@@ -228,6 +230,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,

PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);

+             if (track_io_timing)
+                     INSTR_TIME_SET_CURRENT(io_start);
+             else
+                     INSTR_TIME_SET_ZERO(io_start);
+
/* And write... */
smgrwrite(oreln,
BufTagGetForkNum(&bufHdr->tag),
@@ -239,6 +246,13 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
buf_state &= ~BM_DIRTY;
pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+             if (track_io_timing)
+             {
+                     INSTR_TIME_SET_CURRENT(io_time);
+                     INSTR_TIME_SUBTRACT(io_time, io_start);
+                     pgstat_count_io_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE, io_time);
+             }
+
pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
pgBufferUsage.local_blks_written++;
}

Perhaps we can instead introduce a FlushLocalBuffer()? Then we don't need this
in multiple write paths.

FlushLocalBuffer() is a good idea. It would be nice to have it contain
more than just
pgstat_io_start()
smgrwrite()
pgstat_io_end()
e.g. to have it include checksumming and marking dirty (more like
normal FlushBuffer()). I noticed that LocalBufferAlloc() does not set up
error traceback support for ereport and FlushRelationBuffers() does. Is
this intentional?

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1fe..052875d86a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1030,6 +1030,30 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
{
+             instr_time      io_start,
+                                     io_time;
+
+             if (track_io_timing)
+                     INSTR_TIME_SET_CURRENT(io_start);
+             else
+                     INSTR_TIME_SET_ZERO(io_start);
+
+             ereport(DEBUG1,
+                             (errmsg_internal("could not forward fsync request because request queue is full")));
+
+             if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+                     ereport(data_sync_elevel(ERROR),
+                                     (errcode_for_file_access(),
+                                      errmsg("could not fsync file \"%s\": %m",
+                                                     FilePathName(seg->mdfd_vfd))));
+
+             if (track_io_timing)
+             {
+                     INSTR_TIME_SET_CURRENT(io_time);
+                     INSTR_TIME_SUBTRACT(io_time, io_start);
+                     pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
+             }
+
/*
* We have no way of knowing if the current IOContext is
* IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
@@ -1042,15 +1066,6 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
* backend fsyncs.
*/
pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
-
-             ereport(DEBUG1,
-                             (errmsg_internal("could not forward fsync request because request queue is full")));
-
-             if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-                     ereport(data_sync_elevel(ERROR),
-                                     (errcode_for_file_access(),
-                                      errmsg("could not fsync file \"%s\": %m",
-                                                     FilePathName(seg->mdfd_vfd))));
}
}

@@ -1399,6 +1414,8 @@ int
mdsyncfiletag(const FileTag *ftag, char *path)
{
SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
+ instr_time io_start,
+ io_time;
File file;
bool need_to_close;
int result,
@@ -1425,10 +1442,22 @@ mdsyncfiletag(const FileTag *ftag, char *path)
need_to_close = true;
}

+     if (track_io_timing)
+             INSTR_TIME_SET_CURRENT(io_start);
+     else
+             INSTR_TIME_SET_ZERO(io_start);
+
/* Sync the file. */
result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
save_errno = errno;
+     if (track_io_timing)
+     {
+             INSTR_TIME_SET_CURRENT(io_time);
+             INSTR_TIME_SUBTRACT(io_time, io_start);
+             pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
+     }
+
if (need_to_close)
FileClose(file);

Perhaps we could have mdsyncfd(), used by both mdsyncfiletag() and
register_dirty_segment()?

I agree it would be nice, but it seems like it would take a little bit
of work and might not be worth doing that in this patchset.

@@ -1359,20 +1378,31 @@ pg_stat_get_io(PG_FUNCTION_ARGS)

for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
{
-                                     int                     col_idx = pgstat_get_io_op_index(io_op);
+                                     int                     i = pgstat_get_io_op_index(io_op);
/*
* Some combinations of BackendType and IOOp, of IOContext
* and IOOp, and of IOObject and IOOp are not tracked. Set
* these cells in the view NULL.
*/
-                                     nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+                                     if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+                                             values[i] = Int64GetDatum(bktype_stats->counts[io_obj][io_context][io_op]);
+                                     else
+                                             nulls[i] = true;
+                             }

These lines were already too long, and it's getting worse with this change.

I've started using local variables.

typedef struct PgStat_BktypeIO
{
-     PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+     PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+     instr_time      times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
} PgStat_BktypeIO;

Ah, you're going to hate me. We can't store instr_time on disk. There's
another patch that gets substantial peformance gains by varying the frequency
at which instr_time keeps track of time based on the CPU frequency...

What does that have to do with what we can store on disk?

If so, would it not be enough to do this when reading/writing the stats
file?

void
instr_time_deserialize(instr_time *dest, int64 *src, int length)
{
for (size_t i = 0; i < length; i++)
{
INSTR_TIME_SET_ZERO(dest[i]);
dest[i].ticks = src[i];
}
}

void
instr_time_serialize(int64 *dest, instr_time *src, int length)
{
for (size_t i = 0; i < length; i++)
dest[i] = INSTR_TIME_GET_NANOSEC(src[i]);

}

It also just doesn't have enough range to keep track of system wide
time on a larger system. A single backend won't run for 293 years, but
with a few thousand backends that's a whole different story.

I think we need to accumulate in instr_time, but convert to floating point
when flushing stats.

Hmmm. So, are you saying that we need to read from disk when we query
the view and add that to what is in shared memory? That we only store
the delta since the last restart in the instr_time array?

But, I don't see how that avoids the problem of backend total runtime
being 293 years. We would have to reset and write out the delta whenever
we thought the times would overflow.

But, maybe I am misunderstanding something.

- Melanie

Attachments:

v4-0001-Track-IO-times-in-pg_stat_io.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Track-IO-times-in-pg_stat_io.patchDownload
From ace28d989df51079033af67bb86a756bfeac2ba3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 6 Mar 2023 10:41:51 -0500
Subject: [PATCH v4] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml           | 59 ++++++++++++++++
 src/backend/catalog/system_views.sql   |  4 ++
 src/backend/storage/buffer/bufmgr.c    | 56 ++++++---------
 src/backend/storage/buffer/localbuf.c  |  6 +-
 src/backend/storage/smgr/md.c          | 27 +++++---
 src/backend/utils/activity/pgstat.c    | 77 ++++++++++++++++++++-
 src/backend/utils/activity/pgstat_io.c | 96 +++++++++++++++++++++-----
 src/backend/utils/adt/pgstatfuncs.c    | 41 +++++++++--
 src/include/catalog/pg_proc.dat        |  6 +-
 src/include/pgstat.h                   | 10 ++-
 src/test/regress/expected/rules.out    |  6 +-
 11 files changed, 311 insertions(+), 77 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d0..ad3667f258 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in read operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>write_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in write operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3838,6 +3862,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extend_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in extend operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3902,6 +3938,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>fsync_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in fsync operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3967,6 +4015,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    </itemizedlist>
   </para>
 
+  <note>
+   <para>
+    Columns tracking I/O time will only be non-zero when <xref
+    linkend="guc-track-io-timing"/> is enabled. The user should be careful when
+    using these columns in combination with their corresponding operations to
+    ensure that <varname>track_io_timing</varname> was enabled for the entire
+    time since the last reset.
+   </para>
+  </note>
+
+
 
  </sect2>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..39391bc2fc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1123,12 +1123,16 @@ SELECT
        b.io_object,
        b.io_context,
        b.reads,
+       b.read_time,
        b.writes,
+       b.write_time,
        b.extends,
+       b.extend_time,
        b.op_bytes,
        b.evictions,
        b.reuses,
        b.fsyncs,
+       b.fsync_time,
        b.stats_reset
 FROM pg_stat_get_io() b;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0a05577b68..64857f1ff6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1000,12 +1000,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		instr_time	io_start;
+
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		io_start = pgstat_io_start();
+
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, bufBlock, false);
 
-		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
+		pgstat_io_end(io_start, io_object, io_context, IOOP_EXTEND);
 
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
@@ -1024,25 +1029,13 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
-			instr_time	io_start,
-						io_time;
+			instr_time	io_start;
 
-			if (track_io_timing)
-				INSTR_TIME_SET_CURRENT(io_start);
-			else
-				INSTR_TIME_SET_ZERO(io_start);
+			io_start = pgstat_io_start();
 
 			smgrread(smgr, forkNum, blockNum, bufBlock);
 
-			pgstat_count_io_op(io_object, io_context, IOOP_READ);
-
-			if (track_io_timing)
-			{
-				INSTR_TIME_SET_CURRENT(io_time);
-				INSTR_TIME_SUBTRACT(io_time, io_start);
-				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
-			}
+			pgstat_io_end(io_start, io_object, io_context, IOOP_READ);
 
 			/* check for garbage data */
 			if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
@@ -2848,6 +2841,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
 	*blknum = bufHdr->tag.blockNum;
 }
 
+
 /*
  * FlushBuffer
  *		Physically write out a shared buffer.
@@ -2873,8 +2867,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
-	instr_time	io_start,
-				io_time;
+	instr_time	io_start;
 	Block		bufBlock;
 	char	   *bufToWrite;
 	uint32		buf_state;
@@ -2949,10 +2942,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
-	if (track_io_timing)
-		INSTR_TIME_SET_CURRENT(io_start);
-	else
-		INSTR_TIME_SET_ZERO(io_start);
+	io_start = pgstat_io_start();
 
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
@@ -2981,17 +2971,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * When a strategy is not in use, the write can only be a "regular" write
 	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
 	 */
-	pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
-
-	if (track_io_timing)
-	{
-		INSTR_TIME_SET_CURRENT(io_time);
-		INSTR_TIME_SUBTRACT(io_time, io_start);
-		pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-		INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
-	}
-
-	pgBufferUsage.shared_blks_written++;
+	pgstat_io_end(io_start, IOOBJECT_RELATION, io_context, IOOP_WRITE);
 
 	/*
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
@@ -3597,6 +3577,7 @@ FlushRelationBuffers(Relation rel)
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			uint32		buf_state;
+			instr_time	io_start;
 
 			bufHdr = GetLocalBufferDescriptor(i);
 			if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator) &&
@@ -3616,6 +3597,12 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+				/*
+				 * TODO: is it okay that this formerly didn't count
+				 * pgBufferUsage or pgStatBlockReadTime
+				 */
+				io_start = pgstat_io_start();
+
 				smgrwrite(RelationGetSmgr(rel),
 						  BufTagGetForkNum(&bufHdr->tag),
 						  bufHdr->tag.blockNum,
@@ -3625,13 +3612,14 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
+				pgstat_io_end(io_start, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
 		}
 
+
 		return;
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5325ddb663..af45a72934 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -220,6 +220,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	if (buf_state & BM_DIRTY)
 	{
+		instr_time	io_start;
 		SMgrRelation oreln;
 		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
@@ -228,6 +229,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+		io_start = pgstat_io_start();
+
 		/* And write... */
 		smgrwrite(oreln,
 				  BufTagGetForkNum(&bufHdr->tag),
@@ -239,8 +242,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
-		pgBufferUsage.local_blks_written++;
+		pgstat_io_end(io_start, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 	}
 
 	/*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1fe..2440211f03 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1030,6 +1030,17 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		instr_time	io_start = pgstat_io_start();
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(seg->mdfd_vfd))));
+
 		/*
 		 * We have no way of knowing if the current IOContext is
 		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
@@ -1041,16 +1052,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
 		 * backend fsyncs.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
-
-		ereport(DEBUG1,
-				(errmsg_internal("could not forward fsync request because request queue is full")));
-
-		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(data_sync_elevel(ERROR),
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->mdfd_vfd))));
+		pgstat_io_end(io_start, IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
 	}
 }
 
@@ -1399,6 +1401,7 @@ int
 mdsyncfiletag(const FileTag *ftag, char *path)
 {
 	SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
+	instr_time	io_start;
 	File		file;
 	bool		need_to_close;
 	int			result,
@@ -1425,6 +1428,8 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
+	io_start = pgstat_io_start();
+
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
 	save_errno = errno;
@@ -1432,7 +1437,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
-	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
+	pgstat_io_end(io_start, IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
 
 	errno = save_errno;
 	return result;
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 60fc4e761f..ed27dbfaab 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -95,9 +95,11 @@
 
 #include "access/transam.h"
 #include "access/xact.h"
+#include "executor/instrument.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
@@ -108,7 +110,6 @@
 #include "utils/pgstat_internal.h"
 #include "utils/timestamp.h"
 
-
 /* ----------
  * Timer definitions.
  *
@@ -906,6 +907,80 @@ pgstat_have_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 	return pgstat_get_entry_ref(kind, dboid, objoid, false, NULL) != NULL;
 }
 
+instr_time
+pgstat_io_start(void)
+{
+	instr_time	io_start;
+
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
+
+	return io_start;
+}
+
+void
+pgstat_io_end(instr_time io_start, IOObject io_object,
+					IOContext io_context, IOOp io_op)
+{
+	instr_time	io_time;
+
+	if (track_io_timing)
+	{
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, io_start);
+		pgstat_count_io_time(io_object, io_context, io_op, io_time);
+
+		if (io_op == IOOP_WRITE)
+		{
+			if (io_object == IOOBJECT_RELATION)
+			{
+				/* TODO: AFAICT, pgstat_count_buffer_write_time is only called */
+				/* for shared buffers whereas pgstat_count_buffer_read_time is */
+				/* called for temp relations and shared buffers. */
+				/*
+				 * is this intentional and should I match current behavior or
+				 * not?
+				 */
+				pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+			}
+		}
+		else if (io_op == IOOP_READ)
+		{
+			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object == IOOBJECT_RELATION)
+			{
+				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+			}
+		}
+	}
+
+	pgstat_count_io_op(io_object, io_context, io_op);
+
+	if (io_op == IOOP_WRITE)
+	{
+		if (io_object == IOOBJECT_RELATION)
+			pgBufferUsage.shared_blks_written++;
+		else if (io_object == IOOBJECT_TEMP_RELATION)
+			pgBufferUsage.local_blks_written++;
+	}
+
+	/*
+	 * TODO: this is normally done later in ReadBuffer_common() is it okay to
+	 * do here?
+	 */
+	else if (io_op == IOOP_READ)
+	{
+		if (io_object == IOOBJECT_RELATION)
+			pgBufferUsage.shared_blks_read++;
+		else if (io_object == IOOBJECT_TEMP_RELATION)
+			pgBufferUsage.local_blks_read++;
+	}
+}
+
+
 /*
  * Ensure snapshot for fixed-numbered 'kind' exists.
  *
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index af5d554610..4a151afed6 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -25,36 +25,48 @@ bool		have_iostats = false;
 
 /*
  * Check that stats have not been counted for any combination of IOObject,
- * IOContext, and IOOp which are not tracked for the passed-in BackendType. The
- * passed-in PgStat_BktypeIO must contain stats from the BackendType specified
- * by the second parameter. Caller is responsible for locking the passed-in
- * PgStat_BktypeIO, if needed.
+ * IOContext, and IOOp which are not tracked for the passed-in BackendType. If
+ * the IOOp is not counted for this combination but IO time is otherwise
+ * tracked for this IOOp, check that IO time has not been counted for this
+ * combination. If stats are tracked for this combination and IO times are
+ * non-zero, counts should be non-zero.
+ *
+ * The passed-in PgStat_BktypeIO must contain stats from the BackendType
+ * specified by the second parameter. Caller is responsible for locking the
+ * passed-in PgStat_BktypeIO, if needed.
  */
 bool
 pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 							 BackendType bktype)
 {
-	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
-
 	for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
 	{
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
-			/*
-			 * Don't bother trying to skip to the next loop iteration if
-			 * pgstat_tracks_io_object() would return false here. We still
-			 * need to validate that each counter is zero anyway.
-			 */
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-				/* No stats, so nothing to validate */
-				if (backend_io->data[io_object][io_context][io_op] == 0)
+				/* we do track it */
+				if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				{
+					/* ensure that if IO times are non-zero, counts are > 0 */
+					if (!INSTR_TIME_IS_ZERO(backend_io->times[io_object][io_context][io_op]) &&
+						backend_io->counts[io_object][io_context][io_op] <= 0)
+						return false;
+
 					continue;
+				}
 
-				/* There are stats and there shouldn't be */
-				if (!bktype_tracked ||
-					!pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				/* we don't track it, and it is not 0 */
+				if (backend_io->counts[io_object][io_context][io_op] != 0)
 					return false;
+
+				/* we don't track this IOOp, so make sure its IO time is zero */
+				if (pgstat_tracks_io_time(io_op) > -1)
+				{
+					if (!INSTR_TIME_IS_ZERO(backend_io->times[io_object][io_context][io_op]))
+						return false;
+				}
+
 			}
 		}
 	}
@@ -70,7 +82,21 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
-	PendingIOStats.data[io_object][io_context][io_op]++;
+	PendingIOStats.counts[io_object][io_context][io_op]++;
+
+	have_iostats = true;
+}
+
+void
+pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time)
+{
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
+	Assert(pgstat_tracks_io_time(io_op) != -1);
+
+	INSTR_TIME_ADD(PendingIOStats.times[io_object][io_context][io_op], time);
 
 	have_iostats = true;
 }
@@ -114,8 +140,13 @@ pgstat_flush_io(bool nowait)
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
-				bktype_shstats->data[io_object][io_context][io_op] +=
-					PendingIOStats.data[io_object][io_context][io_op];
+			{
+				bktype_shstats->counts[io_object][io_context][io_op] +=
+					PendingIOStats.counts[io_object][io_context][io_op];
+
+				INSTR_TIME_ADD(bktype_shstats->times[io_object][io_context][io_op],
+							   PendingIOStats.times[io_object][io_context][io_op]);
+			}
 		}
 	}
 
@@ -384,3 +415,30 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 
 	return true;
 }
+
+/*
+ * PgStat_BktypeIO->times contains IO times for IOOps. For simplicity this
+ * array has a spot for every IOOp. pgstat_tracks_io_time() is the source of
+ * truth for which IOOps have corresponding IO times.
+ */
+IOOp
+pgstat_tracks_io_time(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_READ:
+			return IOOP_READ;
+		case IOOP_WRITE:
+			return IOOP_WRITE;
+		case IOOP_EXTEND:
+			return IOOP_EXTEND;
+		case IOOP_FSYNC:
+			return IOOP_FSYNC;
+		case IOOP_EVICT:
+		case IOOP_REUSE:
+			return -1;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b61a12382b..9c6e492fe6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1255,12 +1255,16 @@ typedef enum io_stat_col
 	IO_COL_IO_OBJECT,
 	IO_COL_IO_CONTEXT,
 	IO_COL_READS,
+	IO_COL_READ_TIME,
 	IO_COL_WRITES,
+	IO_COL_WRITE_TIME,
 	IO_COL_EXTENDS,
+	IO_COL_EXTEND_TIME,
 	IO_COL_CONVERSION,
 	IO_COL_EVICTIONS,
 	IO_COL_REUSES,
 	IO_COL_FSYNCS,
+	IO_COL_FSYNC_TIME,
 	IO_COL_RESET_TIME,
 	IO_NUM_COLUMNS,
 } io_stat_col;
@@ -1292,6 +1296,21 @@ pgstat_get_io_op_index(IOOp io_op)
 	pg_unreachable();
 }
 
+/*
+ * Get the number of the column containing IO times for the specified IOOp. If
+ * the specified IOOp is one for which IO time is not tracked, return -1. Note
+ * that this function assumes that IO time for an IOOp is displayed in the view
+ * in the column directly after the IOOp counts.
+ */
+static io_stat_col
+pgstat_get_io_time_index(IOOp io_op)
+{
+	if (pgstat_tracks_io_time(io_op) == -1)
+		return -1;
+
+	return pgstat_get_io_op_index(io_op) + 1;
+}
+
 Datum
 pg_stat_get_io(PG_FUNCTION_ARGS)
 {
@@ -1359,20 +1378,32 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
-					int			col_idx = pgstat_get_io_op_index(io_op);
+					PgStat_Counter count = bktype_stats->counts[io_obj][io_context][io_op];
+					int			i = pgstat_get_io_op_index(io_op);
 
 					/*
 					 * Some combinations of BackendType and IOOp, of IOContext
 					 * and IOOp, and of IOObject and IOOp are not tracked. Set
 					 * these cells in the view NULL.
 					 */
-					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+					if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+						values[i] = Int64GetDatum(count);
+					else
+						nulls[i] = true;
+				}
+
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					instr_time	time = bktype_stats->times[io_obj][io_context][io_op];
+					int			i = pgstat_get_io_time_index(io_op);
 
-					if (nulls[col_idx])
+					if (i == -1)
 						continue;
 
-					values[col_idx] =
-						Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
+					if (!nulls[pgstat_get_io_op_index(io_op)])
+						values[i] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(time));
+					else
+						nulls[i] = true;
 				}
 
 				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 505595620e..80c0627209 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5721,9 +5721,9 @@
   proname => 'pg_stat_get_io', provolatile => 'v',
   prorows => '30', proretset => 't',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,io_object,io_context,reads,writes,extends,op_bytes,evictions,reuses,fsyncs,stats_reset}',
+  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_object,io_context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f43fac09ed..a214176c1a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -313,7 +313,8 @@ typedef enum IOOp
 
 typedef struct PgStat_BktypeIO
 {
-	PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	instr_time	times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_IO
@@ -475,6 +476,11 @@ extern TimestampTz pgstat_get_stat_snapshot_timestamp(bool *have_snapshot);
 extern PgStat_Kind pgstat_get_kind_from_str(char *kind_str);
 extern bool pgstat_have_entry(PgStat_Kind kind, Oid dboid, Oid objoid);
 
+extern instr_time pgstat_io_start(void);
+
+extern void pgstat_io_end(instr_time io_start, IOObject io_object,
+								IOContext io_context, IOOp io_op);
+
 
 /*
  * Functions in pgstat_archiver.c
@@ -507,6 +513,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
 										 BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern void pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time);
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
@@ -516,6 +523,7 @@ extern bool pgstat_tracks_io_object(BackendType bktype,
 									IOObject io_object, IOContext io_context);
 extern bool pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 								IOContext io_context, IOOp io_op);
+extern IOOp pgstat_tracks_io_time(IOOp io_op);
 
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e953d1f515..5434851314 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1880,14 +1880,18 @@ pg_stat_io| SELECT backend_type,
     io_object,
     io_context,
     reads,
+    read_time,
     writes,
+    write_time,
     extends,
+    extend_time,
     op_bytes,
     evictions,
     reuses,
     fsyncs,
+    fsync_time,
     stats_reset
-   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, writes, extends, op_bytes, evictions, reuses, fsyncs, stats_reset);
+   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, evictions, reuses, fsyncs, fsync_time, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.37.2

#13Andres Freund
andres@anarazel.de
In reply to: Melanie Plageman (#12)
Re: Track IO times in pg_stat_io

Hi,

On 2023-03-09 11:50:38 -0500, Melanie Plageman wrote:

On Tue, Mar 7, 2023 at 1:39 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:

As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
to also check that if counts are not Zero then times are not Zero.

Yes, I think adding some validation around the relationship between
counts and timing should help prevent developers from forgetting to call
pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).

However, I think that we cannot check that if IO counts are non-zero
that IO times are non-zero, because the user may not have
track_io_timing enabled. We can check that if IO times are not zero, IO
counts are not zero. I've done this in the attached v3.

And even if track_io_timing is enabled, the timer granularity might be so low
that we *still* get zeroes.

I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,

And then have pg_stat_reset_shared('io') reset pg_stat_database IO
stats?

Yes.

@@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,

if (isExtend)
{
+             instr_time      io_start,
+                                     io_time;
+
/* new buffers are zero-filled */
MemSet((char *) bufBlock, 0, BLCKSZ);
+
+             if (track_io_timing)
+                     INSTR_TIME_SET_CURRENT(io_start);
+             else
+                     INSTR_TIME_SET_ZERO(io_start);
+

I wonder if there's an argument for tracking this in the existing IO stats as
well. But I guess we've lived with this for a long time...

Not sure I want to include that in this patchset.

No, probably not.

typedef struct PgStat_BktypeIO
{
-     PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+     PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+     instr_time      times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
} PgStat_BktypeIO;

Ah, you're going to hate me. We can't store instr_time on disk. There's
another patch that gets substantial peformance gains by varying the frequency
at which instr_time keeps track of time based on the CPU frequency...

What does that have to do with what we can store on disk?

The frequency can change.

If so, would it not be enough to do this when reading/writing the stats
file?

Theoretically yes. But to me it seems cleaner to do it when flushing to shared
stats. See also the overflow issue below.

void
instr_time_deserialize(instr_time *dest, int64 *src, int length)
{
for (size_t i = 0; i < length; i++)
{
INSTR_TIME_SET_ZERO(dest[i]);
dest[i].ticks = src[i];
}
}

That wouldn't be correct, because what ticks means will at some point change
between postgres stopping and starting.

It also just doesn't have enough range to keep track of system wide
time on a larger system. A single backend won't run for 293 years, but
with a few thousand backends that's a whole different story.

I think we need to accumulate in instr_time, but convert to floating point
when flushing stats.

Hmmm. So, are you saying that we need to read from disk when we query
the view and add that to what is in shared memory? That we only store
the delta since the last restart in the instr_time array?

No, I don't think I am suggesting that. What I am trying to suggest is that
PendingIOStats should contain instr_time, but that PgStat_IO should contain
PgStat_Counter as microseconds, as before.

But, I don't see how that avoids the problem of backend total runtime
being 293 years. We would have to reset and write out the delta whenever
we thought the times would overflow.

The overflow risk is due to storing nanoseconds (which is what instr_time
stores internally on linux now) - which we don't need to do once
accumulatated. Right now we store them as microseconds.

nanosecond range:
((2**63) - 1)/(10**9*60*60*24*365) -> 292 years
microsecond range:
((2**63) - 1)/(10**6*60*60*24*365) -> 292471 years

If you assume 5k connections continually doing IO, a range of 292 years would
last 21 days at nanosecond resolution. At microsecond resolution it's 58
years.

Greetings,

Andres Freund

#14Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#13)
1 attachment(s)
Re: Track IO times in pg_stat_io

v5 attached mostly addresses instr_time persistence issues.

On Tue, Mar 14, 2023 at 6:56 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-03-09 11:50:38 -0500, Melanie Plageman wrote:

On Tue, Mar 7, 2023 at 1:39 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:

As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
to also check that if counts are not Zero then times are not Zero.

Yes, I think adding some validation around the relationship between
counts and timing should help prevent developers from forgetting to call
pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).

However, I think that we cannot check that if IO counts are non-zero
that IO times are non-zero, because the user may not have
track_io_timing enabled. We can check that if IO times are not zero, IO
counts are not zero. I've done this in the attached v3.

And even if track_io_timing is enabled, the timer granularity might be so low
that we *still* get zeroes.

I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,

And then have pg_stat_reset_shared('io') reset pg_stat_database IO
stats?

Yes.

I think this makes sense but I am hesitant to do it in this patchset,
because it feels a bit hidden...maybe?

But, if you feel strongly, I will make the change.

typedef struct PgStat_BktypeIO
{
-     PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+     PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+     instr_time      times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
} PgStat_BktypeIO;

Ah, you're going to hate me. We can't store instr_time on disk. There's
another patch that gets substantial peformance gains by varying the frequency
at which instr_time keeps track of time based on the CPU frequency...

What does that have to do with what we can store on disk?

The frequency can change.

Ah, I see.

If so, would it not be enough to do this when reading/writing the stats
file?

Theoretically yes. But to me it seems cleaner to do it when flushing to shared
stats. See also the overflow issue below.

It also just doesn't have enough range to keep track of system wide
time on a larger system. A single backend won't run for 293 years, but
with a few thousand backends that's a whole different story.

I think we need to accumulate in instr_time, but convert to floating point
when flushing stats.

Hmmm. So, are you saying that we need to read from disk when we query
the view and add that to what is in shared memory? That we only store
the delta since the last restart in the instr_time array?

No, I don't think I am suggesting that. What I am trying to suggest is that
PendingIOStats should contain instr_time, but that PgStat_IO should contain
PgStat_Counter as microseconds, as before.

So, I've modified the code to make a union of instr_time and
PgStat_Counter in PgStat_BktypeIO. I am not quite sure if this is okay.
I store in microsec and then in pg_stat_io, I multiply to get
milliseconds for display.

I considered refactoring pgstat_io_end() to use INSTR_TIME_ACCUM_DIFF()
like [1]/messages/by-id/1feedb83-7aa9-cb4b-5086-598349d3f555@gmail.com, but, in the end I actually think I would end up with more
operations because of the various different counters needing to be
updated. As it is now, I do a single subtract and a few adds (one for
each of the different statistics objects tracking IO times
(pgBufferUsage, pgStatBlockWrite/ReadTime). Whereas, I would need to do
an accum diff for every one of those.

- Melanie

[1]: /messages/by-id/1feedb83-7aa9-cb4b-5086-598349d3f555@gmail.com

Attachments:

v5-0001-Track-IO-times-in-pg_stat_io.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Track-IO-times-in-pg_stat_io.patchDownload
From 5c02cd9cb784bf22f756fb9d92f4bd29ba2e744a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 6 Mar 2023 10:41:51 -0500
Subject: [PATCH v5] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml           | 59 +++++++++++++++
 src/backend/catalog/system_views.sql   |  4 ++
 src/backend/storage/buffer/bufmgr.c    | 56 ++++++---------
 src/backend/storage/buffer/localbuf.c  |  6 +-
 src/backend/storage/smgr/md.c          | 27 ++++---
 src/backend/utils/activity/pgstat.c    | 77 +++++++++++++++++++-
 src/backend/utils/activity/pgstat_io.c | 99 +++++++++++++++++++++-----
 src/backend/utils/adt/pgstatfuncs.c    | 48 +++++++++++--
 src/include/catalog/pg_proc.dat        |  6 +-
 src/include/pgstat.h                   | 14 +++-
 src/test/regress/expected/rules.out    |  6 +-
 11 files changed, 325 insertions(+), 77 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d0..ad3667f258 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in read operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>write_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in write operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3838,6 +3862,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extend_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in extend operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3902,6 +3938,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>fsync_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in fsync operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3967,6 +4015,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    </itemizedlist>
   </para>
 
+  <note>
+   <para>
+    Columns tracking I/O time will only be non-zero when <xref
+    linkend="guc-track-io-timing"/> is enabled. The user should be careful when
+    using these columns in combination with their corresponding operations to
+    ensure that <varname>track_io_timing</varname> was enabled for the entire
+    time since the last reset.
+   </para>
+  </note>
+
+
 
  </sect2>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..39391bc2fc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1123,12 +1123,16 @@ SELECT
        b.io_object,
        b.io_context,
        b.reads,
+       b.read_time,
        b.writes,
+       b.write_time,
        b.extends,
+       b.extend_time,
        b.op_bytes,
        b.evictions,
        b.reuses,
        b.fsyncs,
+       b.fsync_time,
        b.stats_reset
 FROM pg_stat_get_io() b;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0a05577b68..64857f1ff6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1000,12 +1000,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		instr_time	io_start;
+
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		io_start = pgstat_io_start();
+
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, bufBlock, false);
 
-		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
+		pgstat_io_end(io_start, io_object, io_context, IOOP_EXTEND);
 
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
@@ -1024,25 +1029,13 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
-			instr_time	io_start,
-						io_time;
+			instr_time	io_start;
 
-			if (track_io_timing)
-				INSTR_TIME_SET_CURRENT(io_start);
-			else
-				INSTR_TIME_SET_ZERO(io_start);
+			io_start = pgstat_io_start();
 
 			smgrread(smgr, forkNum, blockNum, bufBlock);
 
-			pgstat_count_io_op(io_object, io_context, IOOP_READ);
-
-			if (track_io_timing)
-			{
-				INSTR_TIME_SET_CURRENT(io_time);
-				INSTR_TIME_SUBTRACT(io_time, io_start);
-				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
-			}
+			pgstat_io_end(io_start, io_object, io_context, IOOP_READ);
 
 			/* check for garbage data */
 			if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
@@ -2848,6 +2841,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
 	*blknum = bufHdr->tag.blockNum;
 }
 
+
 /*
  * FlushBuffer
  *		Physically write out a shared buffer.
@@ -2873,8 +2867,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
-	instr_time	io_start,
-				io_time;
+	instr_time	io_start;
 	Block		bufBlock;
 	char	   *bufToWrite;
 	uint32		buf_state;
@@ -2949,10 +2942,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
-	if (track_io_timing)
-		INSTR_TIME_SET_CURRENT(io_start);
-	else
-		INSTR_TIME_SET_ZERO(io_start);
+	io_start = pgstat_io_start();
 
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
@@ -2981,17 +2971,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * When a strategy is not in use, the write can only be a "regular" write
 	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
 	 */
-	pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
-
-	if (track_io_timing)
-	{
-		INSTR_TIME_SET_CURRENT(io_time);
-		INSTR_TIME_SUBTRACT(io_time, io_start);
-		pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-		INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
-	}
-
-	pgBufferUsage.shared_blks_written++;
+	pgstat_io_end(io_start, IOOBJECT_RELATION, io_context, IOOP_WRITE);
 
 	/*
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
@@ -3597,6 +3577,7 @@ FlushRelationBuffers(Relation rel)
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			uint32		buf_state;
+			instr_time	io_start;
 
 			bufHdr = GetLocalBufferDescriptor(i);
 			if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator) &&
@@ -3616,6 +3597,12 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+				/*
+				 * TODO: is it okay that this formerly didn't count
+				 * pgBufferUsage or pgStatBlockReadTime
+				 */
+				io_start = pgstat_io_start();
+
 				smgrwrite(RelationGetSmgr(rel),
 						  BufTagGetForkNum(&bufHdr->tag),
 						  bufHdr->tag.blockNum,
@@ -3625,13 +3612,14 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
+				pgstat_io_end(io_start, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
 		}
 
+
 		return;
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5325ddb663..af45a72934 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -220,6 +220,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	if (buf_state & BM_DIRTY)
 	{
+		instr_time	io_start;
 		SMgrRelation oreln;
 		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
@@ -228,6 +229,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+		io_start = pgstat_io_start();
+
 		/* And write... */
 		smgrwrite(oreln,
 				  BufTagGetForkNum(&bufHdr->tag),
@@ -239,8 +242,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
-		pgBufferUsage.local_blks_written++;
+		pgstat_io_end(io_start, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 	}
 
 	/*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1fe..2440211f03 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1030,6 +1030,17 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		instr_time	io_start = pgstat_io_start();
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(seg->mdfd_vfd))));
+
 		/*
 		 * We have no way of knowing if the current IOContext is
 		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
@@ -1041,16 +1052,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
 		 * backend fsyncs.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
-
-		ereport(DEBUG1,
-				(errmsg_internal("could not forward fsync request because request queue is full")));
-
-		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(data_sync_elevel(ERROR),
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->mdfd_vfd))));
+		pgstat_io_end(io_start, IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
 	}
 }
 
@@ -1399,6 +1401,7 @@ int
 mdsyncfiletag(const FileTag *ftag, char *path)
 {
 	SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
+	instr_time	io_start;
 	File		file;
 	bool		need_to_close;
 	int			result,
@@ -1425,6 +1428,8 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
+	io_start = pgstat_io_start();
+
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
 	save_errno = errno;
@@ -1432,7 +1437,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
-	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
+	pgstat_io_end(io_start, IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
 
 	errno = save_errno;
 	return result;
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index b125802b21..9f5f041d7c 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -95,9 +95,11 @@
 
 #include "access/transam.h"
 #include "access/xact.h"
+#include "executor/instrument.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
@@ -108,7 +110,6 @@
 #include "utils/pgstat_internal.h"
 #include "utils/timestamp.h"
 
-
 /* ----------
  * Timer definitions.
  *
@@ -906,6 +907,80 @@ pgstat_have_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 	return pgstat_get_entry_ref(kind, dboid, objoid, false, NULL) != NULL;
 }
 
+instr_time
+pgstat_io_start(void)
+{
+	instr_time	io_start;
+
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
+
+	return io_start;
+}
+
+void
+pgstat_io_end(instr_time io_start, IOObject io_object,
+			  IOContext io_context, IOOp io_op)
+{
+	instr_time	io_time;
+
+	if (track_io_timing)
+	{
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, io_start);
+		pgstat_count_io_time(io_object, io_context, io_op, io_time);
+
+		if (io_op == IOOP_WRITE)
+		{
+			if (io_object == IOOBJECT_RELATION)
+			{
+				/* TODO: AFAICT, pgstat_count_buffer_write_time is only called */
+				/* for shared buffers whereas pgstat_count_buffer_read_time is */
+				/* called for temp relations and shared buffers. */
+				/*
+				 * is this intentional and should I match current behavior or
+				 * not?
+				 */
+				pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+			}
+		}
+		else if (io_op == IOOP_READ)
+		{
+			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object == IOOBJECT_RELATION)
+			{
+				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+			}
+		}
+	}
+
+	pgstat_count_io_op(io_object, io_context, io_op);
+
+	if (io_op == IOOP_WRITE)
+	{
+		if (io_object == IOOBJECT_RELATION)
+			pgBufferUsage.shared_blks_written++;
+		else if (io_object == IOOBJECT_TEMP_RELATION)
+			pgBufferUsage.local_blks_written++;
+	}
+
+	/*
+	 * TODO: this is normally done later in ReadBuffer_common() is it okay to
+	 * do here?
+	 */
+	else if (io_op == IOOP_READ)
+	{
+		if (io_object == IOOBJECT_RELATION)
+			pgBufferUsage.shared_blks_read++;
+		else if (io_object == IOOBJECT_TEMP_RELATION)
+			pgBufferUsage.local_blks_read++;
+	}
+}
+
+
 /*
  * Ensure snapshot for fixed-numbered 'kind' exists.
  *
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index af5d554610..fe09cbf16d 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -25,36 +25,47 @@ bool		have_iostats = false;
 
 /*
  * Check that stats have not been counted for any combination of IOObject,
- * IOContext, and IOOp which are not tracked for the passed-in BackendType. The
- * passed-in PgStat_BktypeIO must contain stats from the BackendType specified
- * by the second parameter. Caller is responsible for locking the passed-in
- * PgStat_BktypeIO, if needed.
+ * IOContext, and IOOp which are not tracked for the passed-in BackendType. If
+ * the IOOp is not counted for this combination but IO time is otherwise
+ * tracked for this IOOp, check that IO time has not been counted for this
+ * combination. If stats are tracked for this combination and IO times are
+ * non-zero, counts should be non-zero.
+ *
+ * The passed-in PgStat_BktypeIO must contain stats from the BackendType
+ * specified by the second parameter. Caller is responsible for locking the
+ * passed-in PgStat_BktypeIO, if needed.
  */
 bool
 pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 							 BackendType bktype)
 {
-	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
-
 	for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
 	{
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
-			/*
-			 * Don't bother trying to skip to the next loop iteration if
-			 * pgstat_tracks_io_object() would return false here. We still
-			 * need to validate that each counter is zero anyway.
-			 */
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-				/* No stats, so nothing to validate */
-				if (backend_io->data[io_object][io_context][io_op] == 0)
+				/* we do track it */
+				if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				{
+					/* ensure that if IO times are non-zero, counts are > 0 */
+					if (backend_io->times[io_object][io_context][io_op] != 0 &&
+						backend_io->counts[io_object][io_context][io_op] <= 0)
+						return false;
+
 					continue;
+				}
 
-				/* There are stats and there shouldn't be */
-				if (!bktype_tracked ||
-					!pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				/* we don't track it, and it is not 0 */
+				if (backend_io->counts[io_object][io_context][io_op] != 0)
 					return false;
+
+				/* we don't track this IOOp, so make sure its IO time is zero */
+				if (pgstat_tracks_io_time(io_op) > -1)
+				{
+					if (backend_io->times[io_object][io_context][io_op] != 0)
+						return false;
+				}
 			}
 		}
 	}
@@ -70,7 +81,21 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
-	PendingIOStats.data[io_object][io_context][io_op]++;
+	PendingIOStats.counts[io_object][io_context][io_op]++;
+
+	have_iostats = true;
+}
+
+void
+pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time)
+{
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
+	Assert(pgstat_tracks_io_time(io_op) != -1);
+
+	INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op], time);
 
 	have_iostats = true;
 }
@@ -114,8 +139,17 @@ pgstat_flush_io(bool nowait)
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
-				bktype_shstats->data[io_object][io_context][io_op] +=
-					PendingIOStats.data[io_object][io_context][io_op];
+			{
+				instr_time	time;
+
+				bktype_shstats->counts[io_object][io_context][io_op] +=
+					PendingIOStats.counts[io_object][io_context][io_op];
+
+				time = PendingIOStats.pending_times[io_object][io_context][io_op];
+
+				bktype_shstats->times[io_object][io_context][io_op] +=
+					INSTR_TIME_GET_MICROSEC(time);
+			}
 		}
 	}
 
@@ -384,3 +418,30 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 
 	return true;
 }
+
+/*
+ * PgStat_BktypeIO->times contains IO times for IOOps. For simplicity this
+ * array has a spot for every IOOp. pgstat_tracks_io_time() is the source of
+ * truth for which IOOps have corresponding IO times.
+ */
+IOOp
+pgstat_tracks_io_time(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_READ:
+			return IOOP_READ;
+		case IOOP_WRITE:
+			return IOOP_WRITE;
+		case IOOP_EXTEND:
+			return IOOP_EXTEND;
+		case IOOP_FSYNC:
+			return IOOP_FSYNC;
+		case IOOP_EVICT:
+		case IOOP_REUSE:
+			return -1;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 35c6d46555..60fdda1a14 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1255,12 +1255,16 @@ typedef enum io_stat_col
 	IO_COL_IO_OBJECT,
 	IO_COL_IO_CONTEXT,
 	IO_COL_READS,
+	IO_COL_READ_TIME,
 	IO_COL_WRITES,
+	IO_COL_WRITE_TIME,
 	IO_COL_EXTENDS,
+	IO_COL_EXTEND_TIME,
 	IO_COL_CONVERSION,
 	IO_COL_EVICTIONS,
 	IO_COL_REUSES,
 	IO_COL_FSYNCS,
+	IO_COL_FSYNC_TIME,
 	IO_COL_RESET_TIME,
 	IO_NUM_COLUMNS,
 } io_stat_col;
@@ -1292,6 +1296,28 @@ pgstat_get_io_op_index(IOOp io_op)
 	pg_unreachable();
 }
 
+/*
+ * Get the number of the column containing IO times for the specified IOOp. If
+ * the specified IOOp is one for which IO time is not tracked, return -1. Note
+ * that this function assumes that IO time for an IOOp is displayed in the view
+ * in the column directly after the IOOp counts.
+ */
+static io_stat_col
+pgstat_get_io_time_index(IOOp io_op)
+{
+	if (pgstat_tracks_io_time(io_op) == -1)
+		return -1;
+
+	return pgstat_get_io_op_index(io_op) + 1;
+}
+
+static inline
+PgStat_Counter
+pg_stat_micro_to_millisecs(PgStat_Counter val_microsec)
+{
+	return val_microsec * 0.001;
+}
+
 Datum
 pg_stat_get_io(PG_FUNCTION_ARGS)
 {
@@ -1359,20 +1385,32 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
-					int			col_idx = pgstat_get_io_op_index(io_op);
+					PgStat_Counter count = bktype_stats->counts[io_obj][io_context][io_op];
+					int			i = pgstat_get_io_op_index(io_op);
 
 					/*
 					 * Some combinations of BackendType and IOOp, of IOContext
 					 * and IOOp, and of IOObject and IOOp are not tracked. Set
 					 * these cells in the view NULL.
 					 */
-					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+					if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+						values[i] = Int64GetDatum(count);
+					else
+						nulls[i] = true;
+				}
 
-					if (nulls[col_idx])
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					PgStat_Counter time = bktype_stats->times[io_obj][io_context][io_op];
+					int			i = pgstat_get_io_time_index(io_op);
+
+					if (i == -1)
 						continue;
 
-					values[col_idx] =
-						Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
+					if (!nulls[pgstat_get_io_op_index(io_op)])
+						values[i] = Float8GetDatum(pg_stat_micro_to_millisecs(time));
+					else
+						nulls[i] = true;
 				}
 
 				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fbc4aade49..a16c49fad2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5728,9 +5728,9 @@
   proname => 'pg_stat_get_io', provolatile => 'v',
   prorows => '30', proretset => 't',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,io_object,io_context,reads,writes,extends,op_bytes,evictions,reuses,fsyncs,stats_reset}',
+  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_object,io_context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1e418b682b..288a8c7c0e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -304,7 +304,12 @@ typedef enum IOOp
 
 typedef struct PgStat_BktypeIO
 {
-	PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	union
+	{
+		instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+		PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	};
 } PgStat_BktypeIO;
 
 typedef struct PgStat_IO
@@ -466,6 +471,11 @@ extern TimestampTz pgstat_get_stat_snapshot_timestamp(bool *have_snapshot);
 extern PgStat_Kind pgstat_get_kind_from_str(char *kind_str);
 extern bool pgstat_have_entry(PgStat_Kind kind, Oid dboid, Oid objoid);
 
+extern instr_time pgstat_io_start(void);
+
+extern void pgstat_io_end(instr_time io_start, IOObject io_object,
+						  IOContext io_context, IOOp io_op);
+
 
 /*
  * Functions in pgstat_archiver.c
@@ -498,6 +508,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
 										 BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern void pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time);
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
@@ -507,6 +518,7 @@ extern bool pgstat_tracks_io_object(BackendType bktype,
 									IOObject io_object, IOContext io_context);
 extern bool pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 								IOContext io_context, IOOp io_op);
+extern IOOp pgstat_tracks_io_time(IOOp io_op);
 
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e953d1f515..5434851314 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1880,14 +1880,18 @@ pg_stat_io| SELECT backend_type,
     io_object,
     io_context,
     reads,
+    read_time,
     writes,
+    write_time,
     extends,
+    extend_time,
     op_bytes,
     evictions,
     reuses,
     fsyncs,
+    fsync_time,
     stats_reset
-   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, writes, extends, op_bytes, evictions, reuses, fsyncs, stats_reset);
+   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, evictions, reuses, fsyncs, fsync_time, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.37.2

#15Andres Freund
andres@anarazel.de
In reply to: Melanie Plageman (#14)
2 attachment(s)
Re: Track IO times in pg_stat_io

Hi,

On 2023-03-16 17:19:16 -0400, Melanie Plageman wrote:

I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,

And then have pg_stat_reset_shared('io') reset pg_stat_database IO
stats?

Yes.

I think this makes sense but I am hesitant to do it in this patchset,
because it feels a bit hidden...maybe?

I'd not do it in the same commit, but I don't see a problem with doing it in
the same patchset.

Now that I think about it again, this wouldn't make pg_stat_reset_shared('io')
affect pg_stat_database - I was thinking we should use pgstat_io.c stats to
provide the information for pgstat_database.c, using its own pending counter.

No, I don't think I am suggesting that. What I am trying to suggest is that
PendingIOStats should contain instr_time, but that PgStat_IO should contain
PgStat_Counter as microseconds, as before.

So, I've modified the code to make a union of instr_time and
PgStat_Counter in PgStat_BktypeIO. I am not quite sure if this is okay.
I store in microsec and then in pg_stat_io, I multiply to get
milliseconds for display.

Not a fan - what do we gain by having this union? It seems considerably
cleaner to have a struct local to pgstat_io.c that uses instr_time and have a
clean type in PgStat_BktypeIO. In fact, the code worked after just changing
that.

I don't think it makes sense to have pgstat_io_start()/end() as well as
pgstat_count_io*. For one, the name seems in a "too general namespace" - why
not a pgstat_count*?

I considered refactoring pgstat_io_end() to use INSTR_TIME_ACCUM_DIFF()
like [1], but, in the end I actually think I would end up with more
operations because of the various different counters needing to be
updated. As it is now, I do a single subtract and a few adds (one for
each of the different statistics objects tracking IO times
(pgBufferUsage, pgStatBlockWrite/ReadTime). Whereas, I would need to do
an accum diff for every one of those.

Right - that only INSTR_TIME_ACCUM_DIFF() only makes sense if there's just a
single counter to update.

WRT:
/* TODO: AFAICT, pgstat_count_buffer_write_time is only called */
/* for shared buffers whereas pgstat_count_buffer_read_time is */
/* called for temp relations and shared buffers. */
/*
* is this intentional and should I match current behavior or
* not?
*/

It's hard to see how that behaviour could be intentional. Probably worth
fixing in a separate patch. I don't think we're going to backpatch, but it
would make this clearer nonetheless.

Incremental patch with some of the above changed attached.

Btw, it's quite nice how one now can attribute time more easily:

20 connections copying an 8MB file 50 times each:
SELECT reuses, evictions, writes, write_time, extends, extend_time FROM pg_stat_io WHERE backend_type = 'client backend' AND io_object = 'relation' AND io_context='bulkwrite';
┌────────┬───────────┬────────┬────────────┬─────────┬─────────────┐
│ reuses │ evictions │ writes │ write_time │ extends │ extend_time │
├────────┼───────────┼────────┼────────────┼─────────┼─────────────┤
│ 36112 │ 0 │ 36112 │ 141 │ 1523176 │ 8676 │
└────────┴───────────┴────────┴────────────┴─────────┴─────────────┘

20 connections copying an 80MB file 5 times each:
┌─────────┬───────────┬─────────┬────────────┬─────────┬─────────────┐
│ reuses │ evictions │ writes │ write_time │ extends │ extend_time │
├─────────┼───────────┼─────────┼────────────┼─────────┼─────────────┤
│ 1318539 │ 0 │ 1318539 │ 5013 │ 1523339 │ 7873 │
└─────────┴───────────┴─────────┴────────────┴─────────┴─────────────┘
(1 row)

Greetings,

Andres

Attachments:

v6-0001-Track-IO-times-in-pg_stat_io.patchtext/x-diff; charset=us-asciiDownload
From 5d4aa3f6c651006f1ec960f59e24ebc8b5a8ca25 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 6 Mar 2023 10:41:51 -0500
Subject: [PATCH v6 1/2] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
---
 src/include/catalog/pg_proc.dat        |   6 +-
 src/include/pgstat.h                   |  14 +++-
 src/backend/catalog/system_views.sql   |   4 +
 src/backend/storage/buffer/bufmgr.c    |  56 ++++++--------
 src/backend/storage/buffer/localbuf.c  |   6 +-
 src/backend/storage/smgr/md.c          |  27 ++++---
 src/backend/utils/activity/pgstat.c    |  77 ++++++++++++++++++-
 src/backend/utils/activity/pgstat_io.c | 101 ++++++++++++++++++++-----
 src/backend/utils/adt/pgstatfuncs.c    |  48 ++++++++++--
 doc/src/sgml/monitoring.sgml           |  59 +++++++++++++++
 src/test/regress/expected/rules.out    |   6 +-
 11 files changed, 326 insertions(+), 78 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5cf87aeb2c4..4148ec09ae7 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5744,9 +5744,9 @@
   proname => 'pg_stat_get_io', provolatile => 'v',
   prorows => '30', proretset => 't',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,io_object,io_context,reads,writes,extends,op_bytes,evictions,reuses,fsyncs,stats_reset}',
+  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_object,io_context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1e418b682b5..288a8c7c0ef 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -304,7 +304,12 @@ typedef enum IOOp
 
 typedef struct PgStat_BktypeIO
 {
-	PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	union
+	{
+		instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+		PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	};
 } PgStat_BktypeIO;
 
 typedef struct PgStat_IO
@@ -466,6 +471,11 @@ extern TimestampTz pgstat_get_stat_snapshot_timestamp(bool *have_snapshot);
 extern PgStat_Kind pgstat_get_kind_from_str(char *kind_str);
 extern bool pgstat_have_entry(PgStat_Kind kind, Oid dboid, Oid objoid);
 
+extern instr_time pgstat_io_start(void);
+
+extern void pgstat_io_end(instr_time io_start, IOObject io_object,
+						  IOContext io_context, IOOp io_op);
+
 
 /*
  * Functions in pgstat_archiver.c
@@ -498,6 +508,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
 										 BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern void pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time);
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
@@ -507,6 +518,7 @@ extern bool pgstat_tracks_io_object(BackendType bktype,
 									IOObject io_object, IOContext io_context);
 extern bool pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 								IOContext io_context, IOOp io_op);
+extern IOOp pgstat_tracks_io_time(IOOp io_op);
 
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f5..39391bc2fc0 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1123,12 +1123,16 @@ SELECT
        b.io_object,
        b.io_context,
        b.reads,
+       b.read_time,
        b.writes,
+       b.write_time,
        b.extends,
+       b.extend_time,
        b.op_bytes,
        b.evictions,
        b.reuses,
        b.fsyncs,
+       b.fsync_time,
        b.stats_reset
 FROM pg_stat_get_io() b;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0a05577b68d..64857f1ff68 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1000,12 +1000,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		instr_time	io_start;
+
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		io_start = pgstat_io_start();
+
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, bufBlock, false);
 
-		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
+		pgstat_io_end(io_start, io_object, io_context, IOOP_EXTEND);
 
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
@@ -1024,25 +1029,13 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
-			instr_time	io_start,
-						io_time;
+			instr_time	io_start;
 
-			if (track_io_timing)
-				INSTR_TIME_SET_CURRENT(io_start);
-			else
-				INSTR_TIME_SET_ZERO(io_start);
+			io_start = pgstat_io_start();
 
 			smgrread(smgr, forkNum, blockNum, bufBlock);
 
-			pgstat_count_io_op(io_object, io_context, IOOP_READ);
-
-			if (track_io_timing)
-			{
-				INSTR_TIME_SET_CURRENT(io_time);
-				INSTR_TIME_SUBTRACT(io_time, io_start);
-				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
-			}
+			pgstat_io_end(io_start, io_object, io_context, IOOP_READ);
 
 			/* check for garbage data */
 			if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
@@ -2848,6 +2841,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
 	*blknum = bufHdr->tag.blockNum;
 }
 
+
 /*
  * FlushBuffer
  *		Physically write out a shared buffer.
@@ -2873,8 +2867,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
-	instr_time	io_start,
-				io_time;
+	instr_time	io_start;
 	Block		bufBlock;
 	char	   *bufToWrite;
 	uint32		buf_state;
@@ -2949,10 +2942,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
-	if (track_io_timing)
-		INSTR_TIME_SET_CURRENT(io_start);
-	else
-		INSTR_TIME_SET_ZERO(io_start);
+	io_start = pgstat_io_start();
 
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
@@ -2981,17 +2971,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * When a strategy is not in use, the write can only be a "regular" write
 	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
 	 */
-	pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
-
-	if (track_io_timing)
-	{
-		INSTR_TIME_SET_CURRENT(io_time);
-		INSTR_TIME_SUBTRACT(io_time, io_start);
-		pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-		INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
-	}
-
-	pgBufferUsage.shared_blks_written++;
+	pgstat_io_end(io_start, IOOBJECT_RELATION, io_context, IOOP_WRITE);
 
 	/*
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
@@ -3597,6 +3577,7 @@ FlushRelationBuffers(Relation rel)
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			uint32		buf_state;
+			instr_time	io_start;
 
 			bufHdr = GetLocalBufferDescriptor(i);
 			if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator) &&
@@ -3616,6 +3597,12 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+				/*
+				 * TODO: is it okay that this formerly didn't count
+				 * pgBufferUsage or pgStatBlockReadTime
+				 */
+				io_start = pgstat_io_start();
+
 				smgrwrite(RelationGetSmgr(rel),
 						  BufTagGetForkNum(&bufHdr->tag),
 						  bufHdr->tag.blockNum,
@@ -3625,13 +3612,14 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
+				pgstat_io_end(io_start, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
 		}
 
+
 		return;
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5325ddb663d..af45a72934a 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -220,6 +220,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	if (buf_state & BM_DIRTY)
 	{
+		instr_time	io_start;
 		SMgrRelation oreln;
 		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
@@ -228,6 +229,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+		io_start = pgstat_io_start();
+
 		/* And write... */
 		smgrwrite(oreln,
 				  BufTagGetForkNum(&bufHdr->tag),
@@ -239,8 +242,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
-		pgBufferUsage.local_blks_written++;
+		pgstat_io_end(io_start, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 	}
 
 	/*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1feb..2440211f03d 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1030,6 +1030,17 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		instr_time	io_start = pgstat_io_start();
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(seg->mdfd_vfd))));
+
 		/*
 		 * We have no way of knowing if the current IOContext is
 		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
@@ -1041,16 +1052,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
 		 * backend fsyncs.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
-
-		ereport(DEBUG1,
-				(errmsg_internal("could not forward fsync request because request queue is full")));
-
-		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(data_sync_elevel(ERROR),
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->mdfd_vfd))));
+		pgstat_io_end(io_start, IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
 	}
 }
 
@@ -1399,6 +1401,7 @@ int
 mdsyncfiletag(const FileTag *ftag, char *path)
 {
 	SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
+	instr_time	io_start;
 	File		file;
 	bool		need_to_close;
 	int			result,
@@ -1425,6 +1428,8 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
+	io_start = pgstat_io_start();
+
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
 	save_errno = errno;
@@ -1432,7 +1437,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
-	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
+	pgstat_io_end(io_start, IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
 
 	errno = save_errno;
 	return result;
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index b125802b215..9f5f041d7c2 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -95,9 +95,11 @@
 
 #include "access/transam.h"
 #include "access/xact.h"
+#include "executor/instrument.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
@@ -108,7 +110,6 @@
 #include "utils/pgstat_internal.h"
 #include "utils/timestamp.h"
 
-
 /* ----------
  * Timer definitions.
  *
@@ -906,6 +907,80 @@ pgstat_have_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 	return pgstat_get_entry_ref(kind, dboid, objoid, false, NULL) != NULL;
 }
 
+instr_time
+pgstat_io_start(void)
+{
+	instr_time	io_start;
+
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
+
+	return io_start;
+}
+
+void
+pgstat_io_end(instr_time io_start, IOObject io_object,
+			  IOContext io_context, IOOp io_op)
+{
+	instr_time	io_time;
+
+	if (track_io_timing)
+	{
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, io_start);
+		pgstat_count_io_time(io_object, io_context, io_op, io_time);
+
+		if (io_op == IOOP_WRITE)
+		{
+			if (io_object == IOOBJECT_RELATION)
+			{
+				/* TODO: AFAICT, pgstat_count_buffer_write_time is only called */
+				/* for shared buffers whereas pgstat_count_buffer_read_time is */
+				/* called for temp relations and shared buffers. */
+				/*
+				 * is this intentional and should I match current behavior or
+				 * not?
+				 */
+				pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+			}
+		}
+		else if (io_op == IOOP_READ)
+		{
+			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object == IOOBJECT_RELATION)
+			{
+				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+			}
+		}
+	}
+
+	pgstat_count_io_op(io_object, io_context, io_op);
+
+	if (io_op == IOOP_WRITE)
+	{
+		if (io_object == IOOBJECT_RELATION)
+			pgBufferUsage.shared_blks_written++;
+		else if (io_object == IOOBJECT_TEMP_RELATION)
+			pgBufferUsage.local_blks_written++;
+	}
+
+	/*
+	 * TODO: this is normally done later in ReadBuffer_common() is it okay to
+	 * do here?
+	 */
+	else if (io_op == IOOP_READ)
+	{
+		if (io_object == IOOBJECT_RELATION)
+			pgBufferUsage.shared_blks_read++;
+		else if (io_object == IOOBJECT_TEMP_RELATION)
+			pgBufferUsage.local_blks_read++;
+	}
+}
+
+
 /*
  * Ensure snapshot for fixed-numbered 'kind' exists.
  *
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index af5d5546101..fe09cbf16d4 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -25,36 +25,47 @@ bool		have_iostats = false;
 
 /*
  * Check that stats have not been counted for any combination of IOObject,
- * IOContext, and IOOp which are not tracked for the passed-in BackendType. The
- * passed-in PgStat_BktypeIO must contain stats from the BackendType specified
- * by the second parameter. Caller is responsible for locking the passed-in
- * PgStat_BktypeIO, if needed.
+ * IOContext, and IOOp which are not tracked for the passed-in BackendType. If
+ * the IOOp is not counted for this combination but IO time is otherwise
+ * tracked for this IOOp, check that IO time has not been counted for this
+ * combination. If stats are tracked for this combination and IO times are
+ * non-zero, counts should be non-zero.
+ *
+ * The passed-in PgStat_BktypeIO must contain stats from the BackendType
+ * specified by the second parameter. Caller is responsible for locking the
+ * passed-in PgStat_BktypeIO, if needed.
  */
 bool
 pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 							 BackendType bktype)
 {
-	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
-
 	for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
 	{
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
-			/*
-			 * Don't bother trying to skip to the next loop iteration if
-			 * pgstat_tracks_io_object() would return false here. We still
-			 * need to validate that each counter is zero anyway.
-			 */
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-				/* No stats, so nothing to validate */
-				if (backend_io->data[io_object][io_context][io_op] == 0)
-					continue;
+				/* we do track it */
+				if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				{
+					/* ensure that if IO times are non-zero, counts are > 0 */
+					if (backend_io->times[io_object][io_context][io_op] != 0 &&
+						backend_io->counts[io_object][io_context][io_op] <= 0)
+						return false;
 
-				/* There are stats and there shouldn't be */
-				if (!bktype_tracked ||
-					!pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+					continue;
+				}
+
+				/* we don't track it, and it is not 0 */
+				if (backend_io->counts[io_object][io_context][io_op] != 0)
 					return false;
+
+				/* we don't track this IOOp, so make sure its IO time is zero */
+				if (pgstat_tracks_io_time(io_op) > -1)
+				{
+					if (backend_io->times[io_object][io_context][io_op] != 0)
+						return false;
+				}
 			}
 		}
 	}
@@ -70,7 +81,21 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
-	PendingIOStats.data[io_object][io_context][io_op]++;
+	PendingIOStats.counts[io_object][io_context][io_op]++;
+
+	have_iostats = true;
+}
+
+void
+pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time)
+{
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
+	Assert(pgstat_tracks_io_time(io_op) != -1);
+
+	INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op], time);
 
 	have_iostats = true;
 }
@@ -114,8 +139,17 @@ pgstat_flush_io(bool nowait)
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
-				bktype_shstats->data[io_object][io_context][io_op] +=
-					PendingIOStats.data[io_object][io_context][io_op];
+			{
+				instr_time	time;
+
+				bktype_shstats->counts[io_object][io_context][io_op] +=
+					PendingIOStats.counts[io_object][io_context][io_op];
+
+				time = PendingIOStats.pending_times[io_object][io_context][io_op];
+
+				bktype_shstats->times[io_object][io_context][io_op] +=
+					INSTR_TIME_GET_MICROSEC(time);
+			}
 		}
 	}
 
@@ -384,3 +418,30 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 
 	return true;
 }
+
+/*
+ * PgStat_BktypeIO->times contains IO times for IOOps. For simplicity this
+ * array has a spot for every IOOp. pgstat_tracks_io_time() is the source of
+ * truth for which IOOps have corresponding IO times.
+ */
+IOOp
+pgstat_tracks_io_time(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_READ:
+			return IOOP_READ;
+		case IOOP_WRITE:
+			return IOOP_WRITE;
+		case IOOP_EXTEND:
+			return IOOP_EXTEND;
+		case IOOP_FSYNC:
+			return IOOP_FSYNC;
+		case IOOP_EVICT:
+		case IOOP_REUSE:
+			return -1;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 35c6d465553..60fdda1a147 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1255,12 +1255,16 @@ typedef enum io_stat_col
 	IO_COL_IO_OBJECT,
 	IO_COL_IO_CONTEXT,
 	IO_COL_READS,
+	IO_COL_READ_TIME,
 	IO_COL_WRITES,
+	IO_COL_WRITE_TIME,
 	IO_COL_EXTENDS,
+	IO_COL_EXTEND_TIME,
 	IO_COL_CONVERSION,
 	IO_COL_EVICTIONS,
 	IO_COL_REUSES,
 	IO_COL_FSYNCS,
+	IO_COL_FSYNC_TIME,
 	IO_COL_RESET_TIME,
 	IO_NUM_COLUMNS,
 } io_stat_col;
@@ -1292,6 +1296,28 @@ pgstat_get_io_op_index(IOOp io_op)
 	pg_unreachable();
 }
 
+/*
+ * Get the number of the column containing IO times for the specified IOOp. If
+ * the specified IOOp is one for which IO time is not tracked, return -1. Note
+ * that this function assumes that IO time for an IOOp is displayed in the view
+ * in the column directly after the IOOp counts.
+ */
+static io_stat_col
+pgstat_get_io_time_index(IOOp io_op)
+{
+	if (pgstat_tracks_io_time(io_op) == -1)
+		return -1;
+
+	return pgstat_get_io_op_index(io_op) + 1;
+}
+
+static inline
+PgStat_Counter
+pg_stat_micro_to_millisecs(PgStat_Counter val_microsec)
+{
+	return val_microsec * 0.001;
+}
+
 Datum
 pg_stat_get_io(PG_FUNCTION_ARGS)
 {
@@ -1359,20 +1385,32 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
-					int			col_idx = pgstat_get_io_op_index(io_op);
+					PgStat_Counter count = bktype_stats->counts[io_obj][io_context][io_op];
+					int			i = pgstat_get_io_op_index(io_op);
 
 					/*
 					 * Some combinations of BackendType and IOOp, of IOContext
 					 * and IOOp, and of IOObject and IOOp are not tracked. Set
 					 * these cells in the view NULL.
 					 */
-					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+					if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+						values[i] = Int64GetDatum(count);
+					else
+						nulls[i] = true;
+				}
 
-					if (nulls[col_idx])
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					PgStat_Counter time = bktype_stats->times[io_obj][io_context][io_op];
+					int			i = pgstat_get_io_time_index(io_op);
+
+					if (i == -1)
 						continue;
 
-					values[col_idx] =
-						Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
+					if (!nulls[pgstat_get_io_op_index(io_op)])
+						values[i] = Float8GetDatum(pg_stat_micro_to_millisecs(time));
+					else
+						nulls[i] = true;
 				}
 
 				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d02..ad3667f2585 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in read operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>write_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in write operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3838,6 +3862,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extend_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in extend operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3902,6 +3938,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>fsync_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in fsync operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3967,6 +4015,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    </itemizedlist>
   </para>
 
+  <note>
+   <para>
+    Columns tracking I/O time will only be non-zero when <xref
+    linkend="guc-track-io-timing"/> is enabled. The user should be careful when
+    using these columns in combination with their corresponding operations to
+    ensure that <varname>track_io_timing</varname> was enabled for the entire
+    time since the last reset.
+   </para>
+  </note>
+
+
 
  </sect2>
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e953d1f5159..54348513145 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1880,14 +1880,18 @@ pg_stat_io| SELECT backend_type,
     io_object,
     io_context,
     reads,
+    read_time,
     writes,
+    write_time,
     extends,
+    extend_time,
     op_bytes,
     evictions,
     reuses,
     fsyncs,
+    fsync_time,
     stats_reset
-   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, writes, extends, op_bytes, evictions, reuses, fsyncs, stats_reset);
+   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, evictions, reuses, fsyncs, fsync_time, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.38.0

v6-0002-pgstat-io-increment-pgstat_io-hackery.patchtext/x-diff; charset=us-asciiDownload
From f4f1534c105766c64534810b62eba7fea674c2a5 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 20 Mar 2023 19:26:41 -0700
Subject: [PATCH v6 2/2] pgstat: io: increment pgstat_io hackery

---
 src/include/pgstat.h                   | 15 ++---
 src/backend/storage/buffer/bufmgr.c    | 21 ++++---
 src/backend/storage/buffer/localbuf.c  |  5 +-
 src/backend/storage/smgr/md.c          | 12 ++--
 src/backend/utils/activity/pgstat.c    | 77 +-------------------------
 src/backend/utils/activity/pgstat_io.c | 70 ++++++++++++++++++++---
 6 files changed, 90 insertions(+), 110 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 288a8c7c0ef..4adbf29fdfd 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -305,11 +305,7 @@ typedef enum IOOp
 typedef struct PgStat_BktypeIO
 {
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
-	union
-	{
-		instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
-		PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
-	};
+	PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_IO
@@ -471,11 +467,6 @@ extern TimestampTz pgstat_get_stat_snapshot_timestamp(bool *have_snapshot);
 extern PgStat_Kind pgstat_get_kind_from_str(char *kind_str);
 extern bool pgstat_have_entry(PgStat_Kind kind, Oid dboid, Oid objoid);
 
-extern instr_time pgstat_io_start(void);
-
-extern void pgstat_io_end(instr_time io_start, IOObject io_object,
-						  IOContext io_context, IOOp io_op);
-
 
 /*
  * Functions in pgstat_archiver.c
@@ -507,8 +498,10 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
 										 BackendType bktype);
+extern instr_time pgstat_prepare_io_time(void);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
-extern void pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time);
+extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time);
+
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 64857f1ff68..532c7af992d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1005,12 +1005,13 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 
-		io_start = pgstat_io_start();
+		io_start = pgstat_prepare_io_time();
 
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, bufBlock, false);
 
-		pgstat_io_end(io_start, io_object, io_context, IOOP_EXTEND);
+		pgstat_count_io_op_time(io_object, io_context,
+								IOOP_EXTEND, io_start);
 
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
@@ -1031,11 +1032,12 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			instr_time	io_start;
 
-			io_start = pgstat_io_start();
+			io_start = pgstat_prepare_io_time();
 
 			smgrread(smgr, forkNum, blockNum, bufBlock);
 
-			pgstat_io_end(io_start, io_object, io_context, IOOP_READ);
+			pgstat_count_io_op_time(io_object, io_context,
+									IOOP_READ, io_start);
 
 			/* check for garbage data */
 			if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
@@ -2942,7 +2944,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
-	io_start = pgstat_io_start();
+	io_start = pgstat_prepare_io_time();
 
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
@@ -2971,7 +2973,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * When a strategy is not in use, the write can only be a "regular" write
 	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
 	 */
-	pgstat_io_end(io_start, IOOBJECT_RELATION, io_context, IOOP_WRITE);
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
+							IOOP_WRITE, io_start);
 
 	/*
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
@@ -3601,7 +3604,7 @@ FlushRelationBuffers(Relation rel)
 				 * TODO: is it okay that this formerly didn't count
 				 * pgBufferUsage or pgStatBlockReadTime
 				 */
-				io_start = pgstat_io_start();
+				io_start = pgstat_prepare_io_time();
 
 				smgrwrite(RelationGetSmgr(rel),
 						  BufTagGetForkNum(&bufHdr->tag),
@@ -3612,7 +3615,9 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-				pgstat_io_end(io_start, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
+				pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION,
+										IOCONTEXT_NORMAL, IOOP_WRITE,
+										io_start);
 
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index af45a72934a..23f2e2fb759 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -229,7 +229,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-		io_start = pgstat_io_start();
+		io_start = pgstat_prepare_io_time();
 
 		/* And write... */
 		smgrwrite(oreln,
@@ -242,7 +242,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-		pgstat_io_end(io_start, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
+		pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL,
+								IOOP_WRITE, io_start);
 	}
 
 	/*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2440211f03d..51a9134d57f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1030,11 +1030,13 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
-		instr_time	io_start = pgstat_io_start();
+		instr_time	io_start;
 
 		ereport(DEBUG1,
 				(errmsg_internal("could not forward fsync request because request queue is full")));
 
+		io_start = pgstat_prepare_io_time();
+
 		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
@@ -1052,7 +1054,8 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
 		 * backend fsyncs.
 		 */
-		pgstat_io_end(io_start, IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
+		pgstat_count_io_op_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+								IOOP_FSYNC, io_start);
 	}
 }
 
@@ -1428,7 +1431,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
-	io_start = pgstat_io_start();
+	io_start = pgstat_prepare_io_time();
 
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
@@ -1437,7 +1440,8 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
-	pgstat_io_end(io_start, IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
+	pgstat_count_io_op_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+							IOOP_FSYNC, io_start);
 
 	errno = save_errno;
 	return result;
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 9f5f041d7c2..b125802b215 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -95,11 +95,9 @@
 
 #include "access/transam.h"
 #include "access/xact.h"
-#include "executor/instrument.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
 #include "port/atomics.h"
-#include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
@@ -110,6 +108,7 @@
 #include "utils/pgstat_internal.h"
 #include "utils/timestamp.h"
 
+
 /* ----------
  * Timer definitions.
  *
@@ -907,80 +906,6 @@ pgstat_have_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 	return pgstat_get_entry_ref(kind, dboid, objoid, false, NULL) != NULL;
 }
 
-instr_time
-pgstat_io_start(void)
-{
-	instr_time	io_start;
-
-	if (track_io_timing)
-		INSTR_TIME_SET_CURRENT(io_start);
-	else
-		INSTR_TIME_SET_ZERO(io_start);
-
-	return io_start;
-}
-
-void
-pgstat_io_end(instr_time io_start, IOObject io_object,
-			  IOContext io_context, IOOp io_op)
-{
-	instr_time	io_time;
-
-	if (track_io_timing)
-	{
-		INSTR_TIME_SET_CURRENT(io_time);
-		INSTR_TIME_SUBTRACT(io_time, io_start);
-		pgstat_count_io_time(io_object, io_context, io_op, io_time);
-
-		if (io_op == IOOP_WRITE)
-		{
-			if (io_object == IOOBJECT_RELATION)
-			{
-				/* TODO: AFAICT, pgstat_count_buffer_write_time is only called */
-				/* for shared buffers whereas pgstat_count_buffer_read_time is */
-				/* called for temp relations and shared buffers. */
-				/*
-				 * is this intentional and should I match current behavior or
-				 * not?
-				 */
-				pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
-			}
-		}
-		else if (io_op == IOOP_READ)
-		{
-			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-			if (io_object == IOOBJECT_RELATION)
-			{
-				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
-			}
-		}
-	}
-
-	pgstat_count_io_op(io_object, io_context, io_op);
-
-	if (io_op == IOOP_WRITE)
-	{
-		if (io_object == IOOBJECT_RELATION)
-			pgBufferUsage.shared_blks_written++;
-		else if (io_object == IOOBJECT_TEMP_RELATION)
-			pgBufferUsage.local_blks_written++;
-	}
-
-	/*
-	 * TODO: this is normally done later in ReadBuffer_common() is it okay to
-	 * do here?
-	 */
-	else if (io_op == IOOP_READ)
-	{
-		if (io_object == IOOBJECT_RELATION)
-			pgBufferUsage.shared_blks_read++;
-		else if (io_object == IOOBJECT_TEMP_RELATION)
-			pgBufferUsage.local_blks_read++;
-	}
-}
-
-
 /*
  * Ensure snapshot for fixed-numbered 'kind' exists.
  *
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index fe09cbf16d4..08888fa0c5b 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -16,10 +16,19 @@
 
 #include "postgres.h"
 
+#include "executor/instrument.h"
+#include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
 
 
-static PgStat_BktypeIO PendingIOStats;
+typedef struct PgStat_PendingIO
+{
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_PendingIO;
+
+
+static PgStat_PendingIO PendingIOStats;
 bool		have_iostats = false;
 
 
@@ -73,6 +82,19 @@ pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 	return true;
 }
 
+instr_time
+pgstat_prepare_io_time(void)
+{
+	instr_time	io_start;
+
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
+
+	return io_start;
+}
+
 void
 pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
 {
@@ -86,18 +108,48 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
 	have_iostats = true;
 }
 
+/*
+ * Like pgstat_count_io_op() except it also accumulates time.
+ */
 void
-pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time)
+pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
+						instr_time start_time)
 {
-	Assert(io_object < IOOBJECT_NUM_TYPES);
-	Assert(io_context < IOCONTEXT_NUM_TYPES);
-	Assert(io_op < IOOP_NUM_TYPES);
-	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
-	Assert(pgstat_tracks_io_time(io_op) != -1);
+	if (track_io_timing)
+	{
+		instr_time	io_time;
 
-	INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op], time);
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, start_time);
 
-	have_iostats = true;
+		if (io_op == IOOP_WRITE)
+		{
+			if (io_object == IOOBJECT_RELATION)
+			{
+				/* TODO: AFAICT, pgstat_count_buffer_write_time is only called */
+				/* for shared buffers whereas pgstat_count_buffer_read_time is */
+				/* called for temp relations and shared buffers. */
+				/*
+				 * is this intentional and should I match current behavior or
+				 * not?
+				 */
+				pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+			}
+		}
+		else if (io_op == IOOP_READ)
+		{
+			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object == IOOBJECT_RELATION)
+			{
+				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+			}
+		}
+
+		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op], io_time);
+	}
+
+	pgstat_count_io_op(io_object, io_context, io_op);
 }
 
 PgStat_IO *
-- 
2.38.0

#16Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#15)
4 attachment(s)
Re: Track IO times in pg_stat_io

On Mon, Mar 20, 2023 at 10:34 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-03-16 17:19:16 -0400, Melanie Plageman wrote:

I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,

And then have pg_stat_reset_shared('io') reset pg_stat_database IO
stats?

Yes.

I think this makes sense but I am hesitant to do it in this patchset,
because it feels a bit hidden...maybe?

I'd not do it in the same commit, but I don't see a problem with doing it in
the same patchset.

Now that I think about it again, this wouldn't make pg_stat_reset_shared('io')
affect pg_stat_database - I was thinking we should use pgstat_io.c stats to
provide the information for pgstat_database.c, using its own pending counter.

So, I've done this in the attached. But, won't resetting pgstat_database
be a bit weird if you have built up some IO timing in pending counters
and right after you reset a flush happens and then suddenly the values
are way above 0 again?

I considered refactoring pgstat_io_end() to use INSTR_TIME_ACCUM_DIFF()
like [1], but, in the end I actually think I would end up with more
operations because of the various different counters needing to be
updated. As it is now, I do a single subtract and a few adds (one for
each of the different statistics objects tracking IO times
(pgBufferUsage, pgStatBlockWrite/ReadTime). Whereas, I would need to do
an accum diff for every one of those.

Right - that only INSTR_TIME_ACCUM_DIFF() only makes sense if there's just a
single counter to update.

WRT:
/* TODO: AFAICT, pgstat_count_buffer_write_time is only called */
/* for shared buffers whereas pgstat_count_buffer_read_time is */
/* called for temp relations and shared buffers. */
/*
* is this intentional and should I match current behavior or
* not?
*/

It's hard to see how that behaviour could be intentional. Probably worth
fixing in a separate patch. I don't think we're going to backpatch, but it
would make this clearer nonetheless.

Attached v7 does this in separate commits.

Remaining feedback is about FlushLocalBuffers(). Is the idea simply to
get it into bufmgr.c because that is cleaner from an API perspective?

- Melanie

Attachments:

v7-0002-FlushRelationBuffers-counts-temp-relation-IO-timi.patchtext/x-patch; charset=US-ASCII; name=v7-0002-FlushRelationBuffers-counts-temp-relation-IO-timi.patchDownload
From a7ba3cce6dbbde49efa5b20e2db5cd49c259d3ad Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 21 Mar 2023 18:20:44 -0400
Subject: [PATCH v7 2/4] FlushRelationBuffers() counts temp relation IO timing

Add pgstat_database and pgBufferUsage IO timing counting to
FlushRelationBuffers() for writes of temporary relations.
---
 src/backend/storage/buffer/bufmgr.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0a05577b68..dea2e8fe40 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3591,6 +3591,8 @@ FlushRelationBuffers(Relation rel)
 {
 	int			i;
 	BufferDesc *bufHdr;
+	instr_time	io_start,
+				io_time;
 
 	if (RelationUsesLocalBuffers(rel))
 	{
@@ -3616,17 +3618,33 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+				if (track_io_timing)
+					INSTR_TIME_SET_CURRENT(io_start);
+				else
+					INSTR_TIME_SET_ZERO(io_start);
+
 				smgrwrite(RelationGetSmgr(rel),
 						  BufTagGetForkNum(&bufHdr->tag),
 						  bufHdr->tag.blockNum,
 						  localpage,
 						  false);
 
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
 				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 
+				if (track_io_timing)
+				{
+					INSTR_TIME_SET_CURRENT(io_time);
+					INSTR_TIME_SUBTRACT(io_time, io_start);
+					pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+					INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+				}
+
+				pgBufferUsage.local_blks_written++;
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
-- 
2.37.2

v7-0001-Count-IO-time-for-temp-relation-writes.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Count-IO-time-for-temp-relation-writes.patchDownload
From 1088cde0ea0d39b7e55cd919f4c9151136f36e28 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 21 Mar 2023 16:00:55 -0400
Subject: [PATCH v7 1/4] Count IO time for temp relation writes

Both pgstat_database and pgBufferUsage write times failed to count
timing for flushes of dirty local buffers when acquiring a new local
buffer for a temporary relation block.
---
 src/backend/storage/buffer/localbuf.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5325ddb663..80510411ae 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -114,6 +114,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	LocalBufferLookupEnt *hresult;
 	BufferDesc *bufHdr;
 	int			b;
+	instr_time	io_start,
+				io_time;
 	int			trycounter;
 	bool		found;
 	uint32		buf_state;
@@ -228,6 +230,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
 		/* And write... */
 		smgrwrite(oreln,
 				  BufTagGetForkNum(&bufHdr->tag),
@@ -240,6 +247,15 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
 		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
+
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+			INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+		}
+
 		pgBufferUsage.local_blks_written++;
 	}
 
-- 
2.37.2

v7-0004-pgstat_database-uses-pgstat_io-time-counters.patchtext/x-patch; charset=US-ASCII; name=v7-0004-pgstat_database-uses-pgstat_io-time-counters.patchDownload
From 664c25813452060536a5f4afe49feae44ed17187 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 21 Mar 2023 20:36:10 -0400
Subject: [PATCH v7 4/4] pgstat_database uses pgstat_io time counters

Use pgstat_io's pending counters to increment pgStatBlockWriteTime and
pgStatBlockReadTime.
---
 src/backend/utils/activity/pgstat_io.c | 14 ++++++++------
 src/include/pgstat.h                   |  4 ----
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 0cf4b78947..c22c178078 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -124,13 +124,11 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 
 		if (io_op == IOOP_WRITE)
 		{
-			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
 		}
 		else if (io_op == IOOP_READ)
 		{
-			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
 		}
@@ -181,15 +179,19 @@ pgstat_flush_io(bool nowait)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-				instr_time	time;
+				PgStat_Counter time;
 
 				bktype_shstats->counts[io_object][io_context][io_op] +=
 					PendingIOStats.counts[io_object][io_context][io_op];
 
-				time = PendingIOStats.pending_times[io_object][io_context][io_op];
+				time = INSTR_TIME_GET_MICROSEC(PendingIOStats.pending_times[io_object][io_context][io_op]);
 
-				bktype_shstats->times[io_object][io_context][io_op] +=
-					INSTR_TIME_GET_MICROSEC(time);
+				bktype_shstats->times[io_object][io_context][io_op] += time;
+
+				if (io_op == IOOP_WRITE)
+					pgStatBlockWriteTime += time;
+				else if (io_op == IOOP_READ)
+					pgStatBlockReadTime += time;
 			}
 		}
 	}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 4adbf29fdf..3904182075 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -526,10 +526,6 @@ extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
 extern void pgstat_report_checksum_failure(void);
 extern void pgstat_report_connect(Oid dboid);
 
-#define pgstat_count_buffer_read_time(n)							\
-	(pgStatBlockReadTime += (n))
-#define pgstat_count_buffer_write_time(n)							\
-	(pgStatBlockWriteTime += (n))
 #define pgstat_count_conn_active_time(n)							\
 	(pgStatActiveTime += (n))
 #define pgstat_count_conn_txn_idle_time(n)							\
-- 
2.37.2

v7-0003-Track-IO-times-in-pg_stat_io.patchtext/x-patch; charset=US-ASCII; name=v7-0003-Track-IO-times-in-pg_stat_io.patchDownload
From 12ce42fc1093ac37736e98ccb7d38ff030f435c2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 21 Mar 2023 19:02:01 -0400
Subject: [PATCH v7 3/4] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml           |  59 ++++++++++
 src/backend/catalog/system_views.sql   |   4 +
 src/backend/storage/buffer/bufmgr.c    |  69 ++++--------
 src/backend/storage/buffer/localbuf.c  |  20 +---
 src/backend/storage/smgr/md.c          |  31 ++++--
 src/backend/utils/activity/pgstat_io.c | 142 +++++++++++++++++++++----
 src/backend/utils/adt/pgstatfuncs.c    |  48 ++++++++-
 src/include/catalog/pg_proc.dat        |   6 +-
 src/include/pgstat.h                   |   7 +-
 src/test/regress/expected/rules.out    |   6 +-
 10 files changed, 288 insertions(+), 104 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d0..ad3667f258 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in read operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>write_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in write operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3838,6 +3862,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extend_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in extend operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3902,6 +3938,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>fsync_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in fsync operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3967,6 +4015,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    </itemizedlist>
   </para>
 
+  <note>
+   <para>
+    Columns tracking I/O time will only be non-zero when <xref
+    linkend="guc-track-io-timing"/> is enabled. The user should be careful when
+    using these columns in combination with their corresponding operations to
+    ensure that <varname>track_io_timing</varname> was enabled for the entire
+    time since the last reset.
+   </para>
+  </note>
+
+
 
  </sect2>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..39391bc2fc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1123,12 +1123,16 @@ SELECT
        b.io_object,
        b.io_context,
        b.reads,
+       b.read_time,
        b.writes,
+       b.write_time,
        b.extends,
+       b.extend_time,
        b.op_bytes,
        b.evictions,
        b.reuses,
        b.fsyncs,
+       b.fsync_time,
        b.stats_reset
 FROM pg_stat_get_io() b;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index dea2e8fe40..4467a24853 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1000,12 +1000,18 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		instr_time	io_start;
+
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		io_start = pgstat_prepare_io_time();
+
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, bufBlock, false);
 
-		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
+		pgstat_count_io_op_time(io_object, io_context,
+								IOOP_EXTEND, io_start);
 
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
@@ -1024,25 +1030,14 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
-			instr_time	io_start,
-						io_time;
+			instr_time	io_start;
 
-			if (track_io_timing)
-				INSTR_TIME_SET_CURRENT(io_start);
-			else
-				INSTR_TIME_SET_ZERO(io_start);
+			io_start = pgstat_prepare_io_time();
 
 			smgrread(smgr, forkNum, blockNum, bufBlock);
 
-			pgstat_count_io_op(io_object, io_context, IOOP_READ);
-
-			if (track_io_timing)
-			{
-				INSTR_TIME_SET_CURRENT(io_time);
-				INSTR_TIME_SUBTRACT(io_time, io_start);
-				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
-			}
+			pgstat_count_io_op_time(io_object, io_context,
+									IOOP_READ, io_start);
 
 			/* check for garbage data */
 			if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
@@ -2848,6 +2843,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
 	*blknum = bufHdr->tag.blockNum;
 }
 
+
 /*
  * FlushBuffer
  *		Physically write out a shared buffer.
@@ -2873,8 +2869,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
-	instr_time	io_start,
-				io_time;
+	instr_time	io_start;
 	Block		bufBlock;
 	char	   *bufToWrite;
 	uint32		buf_state;
@@ -2949,10 +2944,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
-	if (track_io_timing)
-		INSTR_TIME_SET_CURRENT(io_start);
-	else
-		INSTR_TIME_SET_ZERO(io_start);
+	io_start = pgstat_prepare_io_time();
 
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
@@ -2981,15 +2973,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * When a strategy is not in use, the write can only be a "regular" write
 	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
 	 */
-	pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
-
-	if (track_io_timing)
-	{
-		INSTR_TIME_SET_CURRENT(io_time);
-		INSTR_TIME_SUBTRACT(io_time, io_start);
-		pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-		INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
-	}
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
+							IOOP_WRITE, io_start);
 
 	pgBufferUsage.shared_blks_written++;
 
@@ -3591,14 +3576,13 @@ FlushRelationBuffers(Relation rel)
 {
 	int			i;
 	BufferDesc *bufHdr;
-	instr_time	io_start,
-				io_time;
 
 	if (RelationUsesLocalBuffers(rel))
 	{
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			uint32		buf_state;
+			instr_time	io_start;
 
 			bufHdr = GetLocalBufferDescriptor(i);
 			if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator) &&
@@ -3618,10 +3602,7 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-				if (track_io_timing)
-					INSTR_TIME_SET_CURRENT(io_start);
-				else
-					INSTR_TIME_SET_ZERO(io_start);
+				io_start = pgstat_prepare_io_time();
 
 				smgrwrite(RelationGetSmgr(rel),
 						  BufTagGetForkNum(&bufHdr->tag),
@@ -3629,19 +3610,12 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
-
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
-
-				if (track_io_timing)
-				{
-					INSTR_TIME_SET_CURRENT(io_time);
-					INSTR_TIME_SUBTRACT(io_time, io_start);
-					pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-					INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
-				}
+				pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION,
+										IOCONTEXT_NORMAL, IOOP_WRITE,
+										io_start);
 
 				pgBufferUsage.local_blks_written++;
 
@@ -3650,6 +3624,7 @@ FlushRelationBuffers(Relation rel)
 			}
 		}
 
+
 		return;
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 80510411ae..86beb40a6f 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -114,8 +114,6 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	LocalBufferLookupEnt *hresult;
 	BufferDesc *bufHdr;
 	int			b;
-	instr_time	io_start,
-				io_time;
 	int			trycounter;
 	bool		found;
 	uint32		buf_state;
@@ -222,6 +220,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	if (buf_state & BM_DIRTY)
 	{
+		instr_time	io_start;
 		SMgrRelation oreln;
 		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
@@ -230,10 +229,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-		if (track_io_timing)
-			INSTR_TIME_SET_CURRENT(io_start);
-		else
-			INSTR_TIME_SET_ZERO(io_start);
+		io_start = pgstat_prepare_io_time();
 
 		/* And write... */
 		smgrwrite(oreln,
@@ -246,16 +242,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
-
-		if (track_io_timing)
-		{
-			INSTR_TIME_SET_CURRENT(io_time);
-			INSTR_TIME_SUBTRACT(io_time, io_start);
-			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-			INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
-		}
-
+		pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL,
+								IOOP_WRITE, io_start);
 		pgBufferUsage.local_blks_written++;
 	}
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1fe..51a9134d57 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1030,6 +1030,19 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		instr_time	io_start;
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		io_start = pgstat_prepare_io_time();
+
+		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(seg->mdfd_vfd))));
+
 		/*
 		 * We have no way of knowing if the current IOContext is
 		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
@@ -1041,16 +1054,8 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
 		 * backend fsyncs.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
-
-		ereport(DEBUG1,
-				(errmsg_internal("could not forward fsync request because request queue is full")));
-
-		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(data_sync_elevel(ERROR),
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->mdfd_vfd))));
+		pgstat_count_io_op_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+								IOOP_FSYNC, io_start);
 	}
 }
 
@@ -1399,6 +1404,7 @@ int
 mdsyncfiletag(const FileTag *ftag, char *path)
 {
 	SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
+	instr_time	io_start;
 	File		file;
 	bool		need_to_close;
 	int			result,
@@ -1425,6 +1431,8 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
+	io_start = pgstat_prepare_io_time();
+
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
 	save_errno = errno;
@@ -1432,7 +1440,8 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
-	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
+	pgstat_count_io_op_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+							IOOP_FSYNC, io_start);
 
 	errno = save_errno;
 	return result;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index af5d554610..0cf4b78947 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -16,45 +16,65 @@
 
 #include "postgres.h"
 
+#include "executor/instrument.h"
+#include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
 
 
-static PgStat_BktypeIO PendingIOStats;
+typedef struct PgStat_PendingIO
+{
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+}			PgStat_PendingIO;
+
+
+static PgStat_PendingIO PendingIOStats;
 bool		have_iostats = false;
 
 
 /*
  * Check that stats have not been counted for any combination of IOObject,
- * IOContext, and IOOp which are not tracked for the passed-in BackendType. The
- * passed-in PgStat_BktypeIO must contain stats from the BackendType specified
- * by the second parameter. Caller is responsible for locking the passed-in
- * PgStat_BktypeIO, if needed.
+ * IOContext, and IOOp which are not tracked for the passed-in BackendType. If
+ * the IOOp is not counted for this combination but IO time is otherwise
+ * tracked for this IOOp, check that IO time has not been counted for this
+ * combination. If stats are tracked for this combination and IO times are
+ * non-zero, counts should be non-zero.
+ *
+ * The passed-in PgStat_BktypeIO must contain stats from the BackendType
+ * specified by the second parameter. Caller is responsible for locking the
+ * passed-in PgStat_BktypeIO, if needed.
  */
 bool
 pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 							 BackendType bktype)
 {
-	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
-
 	for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
 	{
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
-			/*
-			 * Don't bother trying to skip to the next loop iteration if
-			 * pgstat_tracks_io_object() would return false here. We still
-			 * need to validate that each counter is zero anyway.
-			 */
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-				/* No stats, so nothing to validate */
-				if (backend_io->data[io_object][io_context][io_op] == 0)
+				/* we do track it */
+				if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				{
+					/* ensure that if IO times are non-zero, counts are > 0 */
+					if (backend_io->times[io_object][io_context][io_op] != 0 &&
+						backend_io->counts[io_object][io_context][io_op] <= 0)
+						return false;
+
 					continue;
+				}
 
-				/* There are stats and there shouldn't be */
-				if (!bktype_tracked ||
-					!pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				/* we don't track it, and it is not 0 */
+				if (backend_io->counts[io_object][io_context][io_op] != 0)
 					return false;
+
+				/* we don't track this IOOp, so make sure its IO time is zero */
+				if (pgstat_tracks_io_time(io_op) > -1)
+				{
+					if (backend_io->times[io_object][io_context][io_op] != 0)
+						return false;
+				}
 			}
 		}
 	}
@@ -62,6 +82,19 @@ pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 	return true;
 }
 
+instr_time
+pgstat_prepare_io_time(void)
+{
+	instr_time	io_start;
+
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
+
+	return io_start;
+}
+
 void
 pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
 {
@@ -70,11 +103,44 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
-	PendingIOStats.data[io_object][io_context][io_op]++;
+	PendingIOStats.counts[io_object][io_context][io_op]++;
 
 	have_iostats = true;
 }
 
+/*
+ * Like pgstat_count_io_op() except it also accumulates time.
+ */
+void
+pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
+						instr_time start_time)
+{
+	if (track_io_timing)
+	{
+		instr_time	io_time;
+
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, start_time);
+
+		if (io_op == IOOP_WRITE)
+		{
+			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object == IOOBJECT_RELATION)
+				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+		}
+		else if (io_op == IOOP_READ)
+		{
+			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object == IOOBJECT_RELATION)
+				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+		}
+
+		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op], io_time);
+	}
+
+	pgstat_count_io_op(io_object, io_context, io_op);
+}
+
 PgStat_IO *
 pgstat_fetch_stat_io(void)
 {
@@ -114,8 +180,17 @@ pgstat_flush_io(bool nowait)
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
-				bktype_shstats->data[io_object][io_context][io_op] +=
-					PendingIOStats.data[io_object][io_context][io_op];
+			{
+				instr_time	time;
+
+				bktype_shstats->counts[io_object][io_context][io_op] +=
+					PendingIOStats.counts[io_object][io_context][io_op];
+
+				time = PendingIOStats.pending_times[io_object][io_context][io_op];
+
+				bktype_shstats->times[io_object][io_context][io_op] +=
+					INSTR_TIME_GET_MICROSEC(time);
+			}
 		}
 	}
 
@@ -384,3 +459,30 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 
 	return true;
 }
+
+/*
+ * PgStat_BktypeIO->times contains IO times for IOOps. For simplicity this
+ * array has a spot for every IOOp. pgstat_tracks_io_time() is the source of
+ * truth for which IOOps have corresponding IO times.
+ */
+IOOp
+pgstat_tracks_io_time(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_READ:
+			return IOOP_READ;
+		case IOOP_WRITE:
+			return IOOP_WRITE;
+		case IOOP_EXTEND:
+			return IOOP_EXTEND;
+		case IOOP_FSYNC:
+			return IOOP_FSYNC;
+		case IOOP_EVICT:
+		case IOOP_REUSE:
+			return -1;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 35c6d46555..60fdda1a14 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1255,12 +1255,16 @@ typedef enum io_stat_col
 	IO_COL_IO_OBJECT,
 	IO_COL_IO_CONTEXT,
 	IO_COL_READS,
+	IO_COL_READ_TIME,
 	IO_COL_WRITES,
+	IO_COL_WRITE_TIME,
 	IO_COL_EXTENDS,
+	IO_COL_EXTEND_TIME,
 	IO_COL_CONVERSION,
 	IO_COL_EVICTIONS,
 	IO_COL_REUSES,
 	IO_COL_FSYNCS,
+	IO_COL_FSYNC_TIME,
 	IO_COL_RESET_TIME,
 	IO_NUM_COLUMNS,
 } io_stat_col;
@@ -1292,6 +1296,28 @@ pgstat_get_io_op_index(IOOp io_op)
 	pg_unreachable();
 }
 
+/*
+ * Get the number of the column containing IO times for the specified IOOp. If
+ * the specified IOOp is one for which IO time is not tracked, return -1. Note
+ * that this function assumes that IO time for an IOOp is displayed in the view
+ * in the column directly after the IOOp counts.
+ */
+static io_stat_col
+pgstat_get_io_time_index(IOOp io_op)
+{
+	if (pgstat_tracks_io_time(io_op) == -1)
+		return -1;
+
+	return pgstat_get_io_op_index(io_op) + 1;
+}
+
+static inline
+PgStat_Counter
+pg_stat_micro_to_millisecs(PgStat_Counter val_microsec)
+{
+	return val_microsec * 0.001;
+}
+
 Datum
 pg_stat_get_io(PG_FUNCTION_ARGS)
 {
@@ -1359,20 +1385,32 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
-					int			col_idx = pgstat_get_io_op_index(io_op);
+					PgStat_Counter count = bktype_stats->counts[io_obj][io_context][io_op];
+					int			i = pgstat_get_io_op_index(io_op);
 
 					/*
 					 * Some combinations of BackendType and IOOp, of IOContext
 					 * and IOOp, and of IOObject and IOOp are not tracked. Set
 					 * these cells in the view NULL.
 					 */
-					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+					if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+						values[i] = Int64GetDatum(count);
+					else
+						nulls[i] = true;
+				}
 
-					if (nulls[col_idx])
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					PgStat_Counter time = bktype_stats->times[io_obj][io_context][io_op];
+					int			i = pgstat_get_io_time_index(io_op);
+
+					if (i == -1)
 						continue;
 
-					values[col_idx] =
-						Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
+					if (!nulls[pgstat_get_io_op_index(io_op)])
+						values[i] = Float8GetDatum(pg_stat_micro_to_millisecs(time));
+					else
+						nulls[i] = true;
 				}
 
 				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5cf87aeb2c..4148ec09ae 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5744,9 +5744,9 @@
   proname => 'pg_stat_get_io', provolatile => 'v',
   prorows => '30', proretset => 't',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,io_object,io_context,reads,writes,extends,op_bytes,evictions,reuses,fsyncs,stats_reset}',
+  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_object,io_context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1e418b682b..4adbf29fdf 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -304,7 +304,8 @@ typedef enum IOOp
 
 typedef struct PgStat_BktypeIO
 {
-	PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_IO
@@ -497,7 +498,10 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
 										 BackendType bktype);
+extern instr_time pgstat_prepare_io_time(void);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time);
+
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
@@ -507,6 +511,7 @@ extern bool pgstat_tracks_io_object(BackendType bktype,
 									IOObject io_object, IOContext io_context);
 extern bool pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 								IOContext io_context, IOOp io_op);
+extern IOOp pgstat_tracks_io_time(IOOp io_op);
 
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e953d1f515..5434851314 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1880,14 +1880,18 @@ pg_stat_io| SELECT backend_type,
     io_object,
     io_context,
     reads,
+    read_time,
     writes,
+    write_time,
     extends,
+    extend_time,
     op_bytes,
     evictions,
     reuses,
     fsyncs,
+    fsync_time,
     stats_reset
-   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, writes, extends, op_bytes, evictions, reuses, fsyncs, stats_reset);
+   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, evictions, reuses, fsyncs, fsync_time, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.37.2

#17Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#16)
4 attachment(s)
Re: Track IO times in pg_stat_io

Attached is a rebased version in light of 8aaa04b32d

- Melanie

Attachments:

v8-0001-Count-IO-time-for-temp-relation-writes.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Count-IO-time-for-temp-relation-writes.patchDownload
From 789d4bf1fb749a26523dbcd2c69795916b711c68 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 21 Mar 2023 16:00:55 -0400
Subject: [PATCH v8 1/4] Count IO time for temp relation writes

Both pgstat_database and pgBufferUsage write times failed to count
timing for flushes of dirty local buffers when acquiring a new local
buffer for a temporary relation block.
---
 src/backend/storage/buffer/localbuf.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 6f9e7eda57..ecccb6c1a9 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -114,6 +114,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	LocalBufferLookupEnt *hresult;
 	BufferDesc *bufHdr;
 	int			b;
+	instr_time	io_start,
+				io_time;
 	int			trycounter;
 	bool		found;
 	uint32		buf_state;
@@ -220,6 +222,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
 		/* And write... */
 		smgrwrite(oreln,
 				  BufTagGetForkNum(&bufHdr->tag),
@@ -233,6 +240,15 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		/* Temporary table I/O does not use Buffer Access Strategies */
 		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
+
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+			INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+		}
+
 		pgBufferUsage.local_blks_written++;
 	}
 
-- 
2.37.2

v8-0004-pgstat_database-uses-pgstat_io-time-counters.patchtext/x-patch; charset=US-ASCII; name=v8-0004-pgstat_database-uses-pgstat_io-time-counters.patchDownload
From 726ab546a11707baa167106d5b6266451dfae445 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 21 Mar 2023 20:36:10 -0400
Subject: [PATCH v8 4/4] pgstat_database uses pgstat_io time counters

Use pgstat_io's pending counters to increment pgStatBlockWriteTime and
pgStatBlockReadTime.
---
 src/backend/utils/activity/pgstat_io.c | 14 ++++++++------
 src/include/pgstat.h                   |  4 ----
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 4e98c4749a..905566decd 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -124,13 +124,11 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 
 		if (io_op == IOOP_WRITE)
 		{
-			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
 		}
 		else if (io_op == IOOP_READ)
 		{
-			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
 		}
@@ -181,15 +179,19 @@ pgstat_flush_io(bool nowait)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-				instr_time	time;
+				PgStat_Counter time;
 
 				bktype_shstats->counts[io_object][io_context][io_op] +=
 					PendingIOStats.counts[io_object][io_context][io_op];
 
-				time = PendingIOStats.pending_times[io_object][io_context][io_op];
+				time = INSTR_TIME_GET_MICROSEC(PendingIOStats.pending_times[io_object][io_context][io_op]);
 
-				bktype_shstats->times[io_object][io_context][io_op] +=
-					INSTR_TIME_GET_MICROSEC(time);
+				bktype_shstats->times[io_object][io_context][io_op] += time;
+
+				if (io_op == IOOP_WRITE)
+					pgStatBlockWriteTime += time;
+				else if (io_op == IOOP_READ)
+					pgStatBlockReadTime += time;
 			}
 		}
 	}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bf54c6defe..833476a2bb 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -544,10 +544,6 @@ extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
 extern void pgstat_report_checksum_failure(void);
 extern void pgstat_report_connect(Oid dboid);
 
-#define pgstat_count_buffer_read_time(n)							\
-	(pgStatBlockReadTime += (n))
-#define pgstat_count_buffer_write_time(n)							\
-	(pgStatBlockWriteTime += (n))
 #define pgstat_count_conn_active_time(n)							\
 	(pgStatActiveTime += (n))
 #define pgstat_count_conn_txn_idle_time(n)							\
-- 
2.37.2

v8-0003-Track-IO-times-in-pg_stat_io.patchtext/x-patch; charset=US-ASCII; name=v8-0003-Track-IO-times-in-pg_stat_io.patchDownload
From 2bdad725133395ded199ecc726096e052d6e654b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 31 Mar 2023 15:32:36 -0400
Subject: [PATCH v8 3/4] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml           |  59 ++++++++++
 src/backend/catalog/system_views.sql   |   4 +
 src/backend/storage/buffer/bufmgr.c    |  67 ++++--------
 src/backend/storage/buffer/localbuf.c  |  21 +---
 src/backend/storage/smgr/md.c          |  31 ++++--
 src/backend/utils/activity/pgstat_io.c | 143 +++++++++++++++++++++----
 src/backend/utils/adt/pgstatfuncs.c    |  48 ++++++++-
 src/include/catalog/pg_proc.dat        |   6 +-
 src/include/pgstat.h                   |   7 +-
 src/test/regress/expected/rules.out    |   6 +-
 10 files changed, 287 insertions(+), 105 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d5a45f996d..7b17d38887 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in read operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>write_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in write operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3838,6 +3862,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extend_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in extend operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3913,6 +3949,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>fsync_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in fsync operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3978,6 +4026,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    </itemizedlist>
   </para>
 
+  <note>
+   <para>
+    Columns tracking I/O time will only be non-zero when <xref
+    linkend="guc-track-io-timing"/> is enabled. The user should be careful when
+    using these columns in combination with their corresponding operations to
+    ensure that <varname>track_io_timing</varname> was enabled for the entire
+    time since the last reset.
+   </para>
+  </note>
+
+
 
  </sect2>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 574cbc2e44..ee2973c0a6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1125,13 +1125,17 @@ SELECT
        b.io_object,
        b.io_context,
        b.reads,
+       b.read_time,
        b.writes,
+       b.write_time,
        b.extends,
+       b.extend_time,
        b.op_bytes,
        b.hits,
        b.evictions,
        b.reuses,
        b.fsyncs,
+       b.fsync_time,
        b.stats_reset
 FROM pg_stat_get_io() b;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 05e98d5994..ebceff3b11 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -995,12 +995,18 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		instr_time	io_start;
+
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		io_start = pgstat_prepare_io_time();
+
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, bufBlock, false);
 
-		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
+		pgstat_count_io_op_time(io_object, io_context,
+								IOOP_EXTEND, io_start);
 
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
@@ -1019,25 +1025,14 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
-			instr_time	io_start,
-						io_time;
+			instr_time	io_start;
 
-			if (track_io_timing)
-				INSTR_TIME_SET_CURRENT(io_start);
-			else
-				INSTR_TIME_SET_ZERO(io_start);
+			io_start = pgstat_prepare_io_time();
 
 			smgrread(smgr, forkNum, blockNum, bufBlock);
 
-			pgstat_count_io_op(io_object, io_context, IOOP_READ);
-
-			if (track_io_timing)
-			{
-				INSTR_TIME_SET_CURRENT(io_time);
-				INSTR_TIME_SUBTRACT(io_time, io_start);
-				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
-			}
+			pgstat_count_io_op_time(io_object, io_context,
+									IOOP_READ, io_start);
 
 			/* check for garbage data */
 			if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
@@ -2853,8 +2848,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
-	instr_time	io_start,
-				io_time;
+	instr_time	io_start;
 	Block		bufBlock;
 	char	   *bufToWrite;
 	uint32		buf_state;
@@ -2929,10 +2923,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
-	if (track_io_timing)
-		INSTR_TIME_SET_CURRENT(io_start);
-	else
-		INSTR_TIME_SET_ZERO(io_start);
+	io_start = pgstat_prepare_io_time();
 
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
@@ -2961,15 +2952,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * When a strategy is not in use, the write can only be a "regular" write
 	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
 	 */
-	pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
-
-	if (track_io_timing)
-	{
-		INSTR_TIME_SET_CURRENT(io_time);
-		INSTR_TIME_SUBTRACT(io_time, io_start);
-		pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-		INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
-	}
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
+							IOOP_WRITE, io_start);
 
 	pgBufferUsage.shared_blks_written++;
 
@@ -3571,14 +3555,13 @@ FlushRelationBuffers(Relation rel)
 {
 	int			i;
 	BufferDesc *bufHdr;
-	instr_time	io_start,
-				io_time;
 
 	if (RelationUsesLocalBuffers(rel))
 	{
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			uint32		buf_state;
+			instr_time	io_start;
 
 			bufHdr = GetLocalBufferDescriptor(i);
 			if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator) &&
@@ -3598,10 +3581,7 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-				if (track_io_timing)
-					INSTR_TIME_SET_CURRENT(io_start);
-				else
-					INSTR_TIME_SET_ZERO(io_start);
+				io_start = pgstat_prepare_io_time();
 
 				smgrwrite(RelationGetSmgr(rel),
 						  BufTagGetForkNum(&bufHdr->tag),
@@ -3609,19 +3589,12 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
-
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
-
-				if (track_io_timing)
-				{
-					INSTR_TIME_SET_CURRENT(io_time);
-					INSTR_TIME_SUBTRACT(io_time, io_start);
-					pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-					INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
-				}
+				pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION,
+										IOCONTEXT_NORMAL, IOOP_WRITE,
+										io_start);
 
 				pgBufferUsage.local_blks_written++;
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index ecccb6c1a9..a70e8dc523 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -114,8 +114,6 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	LocalBufferLookupEnt *hresult;
 	BufferDesc *bufHdr;
 	int			b;
-	instr_time	io_start,
-				io_time;
 	int			trycounter;
 	bool		found;
 	uint32		buf_state;
@@ -214,6 +212,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	if (buf_state & BM_DIRTY)
 	{
+		instr_time	io_start;
 		SMgrRelation oreln;
 		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
@@ -222,10 +221,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-		if (track_io_timing)
-			INSTR_TIME_SET_CURRENT(io_start);
-		else
-			INSTR_TIME_SET_ZERO(io_start);
+		io_start = pgstat_prepare_io_time();
 
 		/* And write... */
 		smgrwrite(oreln,
@@ -238,17 +234,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-		/* Temporary table I/O does not use Buffer Access Strategies */
-		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
-
-		if (track_io_timing)
-		{
-			INSTR_TIME_SET_CURRENT(io_time);
-			INSTR_TIME_SUBTRACT(io_time, io_start);
-			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-			INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
-		}
-
+		pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL,
+								IOOP_WRITE, io_start);
 		pgBufferUsage.local_blks_written++;
 	}
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1fe..51a9134d57 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1030,6 +1030,19 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		instr_time	io_start;
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		io_start = pgstat_prepare_io_time();
+
+		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(seg->mdfd_vfd))));
+
 		/*
 		 * We have no way of knowing if the current IOContext is
 		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
@@ -1041,16 +1054,8 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
 		 * backend fsyncs.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
-
-		ereport(DEBUG1,
-				(errmsg_internal("could not forward fsync request because request queue is full")));
-
-		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(data_sync_elevel(ERROR),
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->mdfd_vfd))));
+		pgstat_count_io_op_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+								IOOP_FSYNC, io_start);
 	}
 }
 
@@ -1399,6 +1404,7 @@ int
 mdsyncfiletag(const FileTag *ftag, char *path)
 {
 	SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
+	instr_time	io_start;
 	File		file;
 	bool		need_to_close;
 	int			result,
@@ -1425,6 +1431,8 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
+	io_start = pgstat_prepare_io_time();
+
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
 	save_errno = errno;
@@ -1432,7 +1440,8 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
-	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
+	pgstat_count_io_op_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+							IOOP_FSYNC, io_start);
 
 	errno = save_errno;
 	return result;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index ae8bb34f78..4e98c4749a 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -16,45 +16,65 @@
 
 #include "postgres.h"
 
+#include "executor/instrument.h"
+#include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
 
 
-static PgStat_BktypeIO PendingIOStats;
+typedef struct PgStat_PendingIO
+{
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+}			PgStat_PendingIO;
+
+
+static PgStat_PendingIO PendingIOStats;
 bool		have_iostats = false;
 
 
 /*
  * Check that stats have not been counted for any combination of IOObject,
- * IOContext, and IOOp which are not tracked for the passed-in BackendType. The
- * passed-in PgStat_BktypeIO must contain stats from the BackendType specified
- * by the second parameter. Caller is responsible for locking the passed-in
- * PgStat_BktypeIO, if needed.
+ * IOContext, and IOOp which are not tracked for the passed-in BackendType. If
+ * the IOOp is not counted for this combination but IO time is otherwise
+ * tracked for this IOOp, check that IO time has not been counted for this
+ * combination. If stats are tracked for this combination and IO times are
+ * non-zero, counts should be non-zero.
+ *
+ * The passed-in PgStat_BktypeIO must contain stats from the BackendType
+ * specified by the second parameter. Caller is responsible for locking the
+ * passed-in PgStat_BktypeIO, if needed.
  */
 bool
 pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 							 BackendType bktype)
 {
-	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
-
 	for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
 	{
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
-			/*
-			 * Don't bother trying to skip to the next loop iteration if
-			 * pgstat_tracks_io_object() would return false here. We still
-			 * need to validate that each counter is zero anyway.
-			 */
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-				/* No stats, so nothing to validate */
-				if (backend_io->data[io_object][io_context][io_op] == 0)
+				/* we do track it */
+				if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				{
+					/* ensure that if IO times are non-zero, counts are > 0 */
+					if (backend_io->times[io_object][io_context][io_op] != 0 &&
+						backend_io->counts[io_object][io_context][io_op] <= 0)
+						return false;
+
 					continue;
+				}
 
-				/* There are stats and there shouldn't be */
-				if (!bktype_tracked ||
-					!pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				/* we don't track it, and it is not 0 */
+				if (backend_io->counts[io_object][io_context][io_op] != 0)
 					return false;
+
+				/* we don't track this IOOp, so make sure its IO time is zero */
+				if (pgstat_tracks_io_time(io_op) > -1)
+				{
+					if (backend_io->times[io_object][io_context][io_op] != 0)
+						return false;
+				}
 			}
 		}
 	}
@@ -62,6 +82,19 @@ pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 	return true;
 }
 
+instr_time
+pgstat_prepare_io_time(void)
+{
+	instr_time	io_start;
+
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
+
+	return io_start;
+}
+
 void
 pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
 {
@@ -70,11 +103,44 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
-	PendingIOStats.data[io_object][io_context][io_op]++;
+	PendingIOStats.counts[io_object][io_context][io_op]++;
 
 	have_iostats = true;
 }
 
+/*
+ * Like pgstat_count_io_op() except it also accumulates time.
+ */
+void
+pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
+						instr_time start_time)
+{
+	if (track_io_timing)
+	{
+		instr_time	io_time;
+
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, start_time);
+
+		if (io_op == IOOP_WRITE)
+		{
+			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object == IOOBJECT_RELATION)
+				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+		}
+		else if (io_op == IOOP_READ)
+		{
+			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object == IOOBJECT_RELATION)
+				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+		}
+
+		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op], io_time);
+	}
+
+	pgstat_count_io_op(io_object, io_context, io_op);
+}
+
 PgStat_IO *
 pgstat_fetch_stat_io(void)
 {
@@ -114,8 +180,17 @@ pgstat_flush_io(bool nowait)
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
-				bktype_shstats->data[io_object][io_context][io_op] +=
-					PendingIOStats.data[io_object][io_context][io_op];
+			{
+				instr_time	time;
+
+				bktype_shstats->counts[io_object][io_context][io_op] +=
+					PendingIOStats.counts[io_object][io_context][io_op];
+
+				time = PendingIOStats.pending_times[io_object][io_context][io_op];
+
+				bktype_shstats->times[io_object][io_context][io_op] +=
+					INSTR_TIME_GET_MICROSEC(time);
+			}
 		}
 	}
 
@@ -384,3 +459,31 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 
 	return true;
 }
+
+/*
+ * PgStat_BktypeIO->times contains IO times for IOOps. For simplicity this
+ * array has a spot for every IOOp. pgstat_tracks_io_time() is the source of
+ * truth for which IOOps have corresponding IO times.
+ */
+IOOp
+pgstat_tracks_io_time(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_READ:
+			return IOOP_READ;
+		case IOOP_WRITE:
+			return IOOP_WRITE;
+		case IOOP_EXTEND:
+			return IOOP_EXTEND;
+		case IOOP_FSYNC:
+			return IOOP_FSYNC;
+		case IOOP_EVICT:
+		case IOOP_HIT:
+		case IOOP_REUSE:
+			return -1;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eec9f3cf9b..60382ac76a 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1256,13 +1256,17 @@ typedef enum io_stat_col
 	IO_COL_IO_OBJECT,
 	IO_COL_IO_CONTEXT,
 	IO_COL_READS,
+	IO_COL_READ_TIME,
 	IO_COL_WRITES,
+	IO_COL_WRITE_TIME,
 	IO_COL_EXTENDS,
+	IO_COL_EXTEND_TIME,
 	IO_COL_CONVERSION,
 	IO_COL_HITS,
 	IO_COL_EVICTIONS,
 	IO_COL_REUSES,
 	IO_COL_FSYNCS,
+	IO_COL_FSYNC_TIME,
 	IO_COL_RESET_TIME,
 	IO_NUM_COLUMNS,
 } io_stat_col;
@@ -1296,6 +1300,28 @@ pgstat_get_io_op_index(IOOp io_op)
 	pg_unreachable();
 }
 
+/*
+ * Get the number of the column containing IO times for the specified IOOp. If
+ * the specified IOOp is one for which IO time is not tracked, return -1. Note
+ * that this function assumes that IO time for an IOOp is displayed in the view
+ * in the column directly after the IOOp counts.
+ */
+static io_stat_col
+pgstat_get_io_time_index(IOOp io_op)
+{
+	if (pgstat_tracks_io_time(io_op) == -1)
+		return -1;
+
+	return pgstat_get_io_op_index(io_op) + 1;
+}
+
+static inline
+PgStat_Counter
+pg_stat_micro_to_millisecs(PgStat_Counter val_microsec)
+{
+	return val_microsec * 0.001;
+}
+
 Datum
 pg_stat_get_io(PG_FUNCTION_ARGS)
 {
@@ -1363,20 +1389,32 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
-					int			col_idx = pgstat_get_io_op_index(io_op);
+					PgStat_Counter count = bktype_stats->counts[io_obj][io_context][io_op];
+					int			i = pgstat_get_io_op_index(io_op);
 
 					/*
 					 * Some combinations of BackendType and IOOp, of IOContext
 					 * and IOOp, and of IOObject and IOOp are not tracked. Set
 					 * these cells in the view NULL.
 					 */
-					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+					if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+						values[i] = Int64GetDatum(count);
+					else
+						nulls[i] = true;
+				}
 
-					if (nulls[col_idx])
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					PgStat_Counter time = bktype_stats->times[io_obj][io_context][io_op];
+					int			i = pgstat_get_io_time_index(io_op);
+
+					if (i == -1)
 						continue;
 
-					values[col_idx] =
-						Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
+					if (!nulls[pgstat_get_io_op_index(io_op)])
+						values[i] = Float8GetDatum(pg_stat_micro_to_millisecs(time));
+					else
+						nulls[i] = true;
 				}
 
 				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9f2642201..664a7ab41d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5749,9 +5749,9 @@
   proname => 'pg_stat_get_io', provolatile => 'v',
   prorows => '30', proretset => 't',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,io_object,io_context,reads,writes,extends,op_bytes,hits,evictions,reuses,fsyncs,stats_reset}',
+  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_object,io_context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 75d258d921..bf54c6defe 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -306,7 +306,8 @@ typedef enum IOOp
 
 typedef struct PgStat_BktypeIO
 {
-	PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_IO
@@ -515,7 +516,10 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
 										 BackendType bktype);
+extern instr_time pgstat_prepare_io_time(void);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time);
+
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
@@ -525,6 +529,7 @@ extern bool pgstat_tracks_io_object(BackendType bktype,
 									IOObject io_object, IOContext io_context);
 extern bool pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 								IOContext io_context, IOOp io_op);
+extern IOOp pgstat_tracks_io_time(IOOp io_op);
 
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ab1aebfde4..d939d8067e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1881,15 +1881,19 @@ pg_stat_io| SELECT backend_type,
     io_object,
     io_context,
     reads,
+    read_time,
     writes,
+    write_time,
     extends,
+    extend_time,
     op_bytes,
     hits,
     evictions,
     reuses,
     fsyncs,
+    fsync_time,
     stats_reset
-   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, writes, extends, op_bytes, hits, evictions, reuses, fsyncs, stats_reset);
+   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.37.2

v8-0002-FlushRelationBuffers-counts-temp-relation-IO-timi.patchtext/x-patch; charset=US-ASCII; name=v8-0002-FlushRelationBuffers-counts-temp-relation-IO-timi.patchDownload
From f4e0db5c833f33b30d4c0b4bebec1096a1745d81 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 21 Mar 2023 18:20:44 -0400
Subject: [PATCH v8 2/4] FlushRelationBuffers() counts temp relation IO timing

Add pgstat_database and pgBufferUsage IO timing counting to
FlushRelationBuffers() for writes of temporary relations.
---
 src/backend/storage/buffer/bufmgr.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b3adbbe7d2..05e98d5994 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3571,6 +3571,8 @@ FlushRelationBuffers(Relation rel)
 {
 	int			i;
 	BufferDesc *bufHdr;
+	instr_time	io_start,
+				io_time;
 
 	if (RelationUsesLocalBuffers(rel))
 	{
@@ -3596,17 +3598,33 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+				if (track_io_timing)
+					INSTR_TIME_SET_CURRENT(io_start);
+				else
+					INSTR_TIME_SET_ZERO(io_start);
+
 				smgrwrite(RelationGetSmgr(rel),
 						  BufTagGetForkNum(&bufHdr->tag),
 						  bufHdr->tag.blockNum,
 						  localpage,
 						  false);
 
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
 				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 
+				if (track_io_timing)
+				{
+					INSTR_TIME_SET_CURRENT(io_time);
+					INSTR_TIME_SUBTRACT(io_time, io_start);
+					pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+					INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+				}
+
+				pgBufferUsage.local_blks_written++;
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
-- 
2.37.2

#18Andres Freund
andres@anarazel.de
In reply to: Melanie Plageman (#17)
Re: Track IO times in pg_stat_io

Hi,

On 2023-03-31 15:44:58 -0400, Melanie Plageman wrote:

From 789d4bf1fb749a26523dbcd2c69795916b711c68 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 21 Mar 2023 16:00:55 -0400
Subject: [PATCH v8 1/4] Count IO time for temp relation writes

Both pgstat_database and pgBufferUsage write times failed to count
timing for flushes of dirty local buffers when acquiring a new local
buffer for a temporary relation block.

I think it'd be worth mentioning here that we do count read time? Otherwise
it'd not be as clear that adding tracking increases consistency...

From f4e0db5c833f33b30d4c0b4bebec1096a1745d81 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 21 Mar 2023 18:20:44 -0400
Subject: [PATCH v8 2/4] FlushRelationBuffers() counts temp relation IO timing

Add pgstat_database and pgBufferUsage IO timing counting to
FlushRelationBuffers() for writes of temporary relations.
---
src/backend/storage/buffer/bufmgr.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b3adbbe7d2..05e98d5994 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3571,6 +3571,8 @@ FlushRelationBuffers(Relation rel)
{
int			i;
BufferDesc *bufHdr;
+	instr_time	io_start,
+				io_time;

if (RelationUsesLocalBuffers(rel))
{
@@ -3596,17 +3598,33 @@ FlushRelationBuffers(Relation rel)

PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);

+				if (track_io_timing)
+					INSTR_TIME_SET_CURRENT(io_start);
+				else
+					INSTR_TIME_SET_ZERO(io_start);
+
smgrwrite(RelationGetSmgr(rel),
BufTagGetForkNum(&bufHdr->tag),
bufHdr->tag.blockNum,
localpage,
false);

+

Spurious newline.

buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);

pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);

+				if (track_io_timing)
+				{
+					INSTR_TIME_SET_CURRENT(io_time);
+					INSTR_TIME_SUBTRACT(io_time, io_start);
+					pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+					INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+				}
+
+				pgBufferUsage.local_blks_written++;
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
}
--
2.37.2

From 2bdad725133395ded199ecc726096e052d6e654b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 31 Mar 2023 15:32:36 -0400
Subject: [PATCH v8 3/4] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: /messages/by-id/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww@mail.gmail.com
---

-static PgStat_BktypeIO PendingIOStats;
+typedef struct PgStat_PendingIO
+{
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+}			PgStat_PendingIO;

Probably will look less awful after adding the typedef to typedefs.list.

+				/* we do track it */
+				if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				{
+					/* ensure that if IO times are non-zero, counts are > 0 */
+					if (backend_io->times[io_object][io_context][io_op] != 0 &&
+						backend_io->counts[io_object][io_context][io_op] <= 0)
+						return false;
+
continue;
+				}
-				/* There are stats and there shouldn't be */
-				if (!bktype_tracked ||
-					!pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				/* we don't track it, and it is not 0 */
+				if (backend_io->counts[io_object][io_context][io_op] != 0)
return false;
+
+				/* we don't track this IOOp, so make sure its IO time is zero */
+				if (pgstat_tracks_io_time(io_op) > -1)
+				{
+					if (backend_io->times[io_object][io_context][io_op] != 0)
+						return false;
+				}

I'm somehow doubtful it's worth having pgstat_tracks_io_time, what kind of
error would be caught by this check?

+/*
+ * Get the number of the column containing IO times for the specified IOOp. If
+ * the specified IOOp is one for which IO time is not tracked, return -1. Note
+ * that this function assumes that IO time for an IOOp is displayed in the view
+ * in the column directly after the IOOp counts.
+ */
+static io_stat_col
+pgstat_get_io_time_index(IOOp io_op)
+{
+	if (pgstat_tracks_io_time(io_op) == -1)
+		return -1;

That seems dangerous - won't it just lead to accessing something from before
the start of the array? Probably should just assert.

@@ -1363,20 +1389,32 @@ pg_stat_get_io(PG_FUNCTION_ARGS)

for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
{
-					int			col_idx = pgstat_get_io_op_index(io_op);
+					PgStat_Counter count = bktype_stats->counts[io_obj][io_context][io_op];
+					int			i = pgstat_get_io_op_index(io_op);
/*
* Some combinations of BackendType and IOOp, of IOContext
* and IOOp, and of IOObject and IOOp are not tracked. Set
* these cells in the view NULL.
*/
-					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+					if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+						values[i] = Int64GetDatum(count);
+					else
+						nulls[i] = true;
+				}
-					if (nulls[col_idx])
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					PgStat_Counter time = bktype_stats->times[io_obj][io_context][io_op];
+					int			i = pgstat_get_io_time_index(io_op);
+
+					if (i == -1)
continue;
-					values[col_idx] =
-						Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
+					if (!nulls[pgstat_get_io_op_index(io_op)])
+						values[i] = Float8GetDatum(pg_stat_micro_to_millisecs(time));
+					else
+						nulls[i] = true;
}

Why two loops?

Greetings,

Andres Freund

#19Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#18)
4 attachment(s)
Re: Track IO times in pg_stat_io

Attached v9 addresses review feedback as well as resolving merge
conflicts with recent relation extension patchset.

I've changed pgstat_count_io_op_time() to take a count and call
pgstat_count_io_op_n() so it can be used with smgrzeroextend(). I do
wish that the parameter to pgstat_count_io_op_n() was called "count" and
not "cnt"...

I've also reordered the call site of pgstat_count_io_op_time() in a few
locations, but I have some questions about this.

Before, I didn't think it mattered much that we didn't finish counting
IO time until after setting BM_VALID or BM_DIRTY and unsetting
BM_IO_IN_PROGRESS. With the relation extension code doing this for many
buffers at once, though, I wondered if this will make the IO timing too
inaccurate.

As such, I've moved pgstat_count_io_op_time() to before we set those
flags in all locations. I did wonder if it is bad to prolong having the
buffer pinned and not having those flags set, though.

On Tue, Apr 4, 2023 at 8:59 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-03-31 15:44:58 -0400, Melanie Plageman wrote:

From 789d4bf1fb749a26523dbcd2c69795916b711c68 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 21 Mar 2023 16:00:55 -0400
Subject: [PATCH v8 1/4] Count IO time for temp relation writes

Both pgstat_database and pgBufferUsage write times failed to count
timing for flushes of dirty local buffers when acquiring a new local
buffer for a temporary relation block.

I think it'd be worth mentioning here that we do count read time? Otherwise
it'd not be as clear that adding tracking increases consistency...

Done

From f4e0db5c833f33b30d4c0b4bebec1096a1745d81 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 21 Mar 2023 18:20:44 -0400
Subject: [PATCH v8 2/4] FlushRelationBuffers() counts temp relation IO timing

Add pgstat_database and pgBufferUsage IO timing counting to
FlushRelationBuffers() for writes of temporary relations.
---
src/backend/storage/buffer/bufmgr.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b3adbbe7d2..05e98d5994 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3571,6 +3571,8 @@ FlushRelationBuffers(Relation rel)
{
int                     i;
BufferDesc *bufHdr;
+     instr_time      io_start,
+                             io_time;

if (RelationUsesLocalBuffers(rel))
{
@@ -3596,17 +3598,33 @@ FlushRelationBuffers(Relation rel)

PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);

+                             if (track_io_timing)
+                                     INSTR_TIME_SET_CURRENT(io_start);
+                             else
+                                     INSTR_TIME_SET_ZERO(io_start);
+
smgrwrite(RelationGetSmgr(rel),
BufTagGetForkNum(&bufHdr->tag),
bufHdr->tag.blockNum,
localpage,
false);

+

Spurious newline.

Fixed.

From 2bdad725133395ded199ecc726096e052d6e654b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 31 Mar 2023 15:32:36 -0400
Subject: [PATCH v8 3/4] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: /messages/by-id/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww@mail.gmail.com
---

-static PgStat_BktypeIO PendingIOStats;
+typedef struct PgStat_PendingIO
+{
+     PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+     instr_time      pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+}                    PgStat_PendingIO;

Probably will look less awful after adding the typedef to typedefs.list.

Done.
One day I will remember to add things to typedefs.list.

+                             /* we do track it */
+                             if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+                             {
+                                     /* ensure that if IO times are non-zero, counts are > 0 */
+                                     if (backend_io->times[io_object][io_context][io_op] != 0 &&
+                                             backend_io->counts[io_object][io_context][io_op] <= 0)
+                                             return false;
+
continue;
+                             }
-                             /* There are stats and there shouldn't be */
-                             if (!bktype_tracked ||
-                                     !pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+                             /* we don't track it, and it is not 0 */
+                             if (backend_io->counts[io_object][io_context][io_op] != 0)
return false;
+
+                             /* we don't track this IOOp, so make sure its IO time is zero */
+                             if (pgstat_tracks_io_time(io_op) > -1)
+                             {
+                                     if (backend_io->times[io_object][io_context][io_op] != 0)
+                                             return false;
+                             }

I'm somehow doubtful it's worth having pgstat_tracks_io_time, what kind of
error would be caught by this check?

Yea, now that the function to count IO timing also increments the count,
I don't think this can happen.

However, pgstat_tracks_io_time() is useful in its other call site in
pgstatfuncs which lets us continue in the loop if we don't need to fill
in that IO time. Perhaps it could be replaced with a if (io_op ==
IOOP_EVICT || io_op == IOOP_REUSE ... but I kind of like the function?
But, maybe it is overkill...

For now, I've moved pgstat_tracks_io_time() into pgstatfuncs.c as a
helper.

+/*
+ * Get the number of the column containing IO times for the specified IOOp. If
+ * the specified IOOp is one for which IO time is not tracked, return -1. Note
+ * that this function assumes that IO time for an IOOp is displayed in the view
+ * in the column directly after the IOOp counts.
+ */
+static io_stat_col
+pgstat_get_io_time_index(IOOp io_op)
+{
+     if (pgstat_tracks_io_time(io_op) == -1)
+             return -1;

That seems dangerous - won't it just lead to accessing something from before
the start of the array? Probably should just assert.

Yea. I've removed it entirely as the passed in io_op can't be negative
(unless we change the enum values) and we add one to it before
returning.

@@ -1363,20 +1389,32 @@ pg_stat_get_io(PG_FUNCTION_ARGS)

for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
{
-                                     int                     col_idx = pgstat_get_io_op_index(io_op);
+                                     PgStat_Counter count = bktype_stats->counts[io_obj][io_context][io_op];
+                                     int                     i = pgstat_get_io_op_index(io_op);
/*
* Some combinations of BackendType and IOOp, of IOContext
* and IOOp, and of IOObject and IOOp are not tracked. Set
* these cells in the view NULL.
*/
-                                     nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+                                     if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+                                             values[i] = Int64GetDatum(count);
+                                     else
+                                             nulls[i] = true;
+                             }
-                                     if (nulls[col_idx])
+                             for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+                             {
+                                     PgStat_Counter time = bktype_stats->times[io_obj][io_context][io_op];
+                                     int                     i = pgstat_get_io_time_index(io_op);
+
+                                     if (i == -1)
continue;
-                                     values[col_idx] =
-                                             Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
+                                     if (!nulls[pgstat_get_io_op_index(io_op)])
+                                             values[i] = Float8GetDatum(pg_stat_micro_to_millisecs(time));
+                                     else
+                                             nulls[i] = true;
}

Why two loops?

Well, it was a stylistic choice that I now realize is actually
confusing.
I consolidated them.

- Melanie

Attachments:

v9-0004-pgstat_database-uses-pgstat_io-time-counters.patchtext/x-patch; charset=US-ASCII; name=v9-0004-pgstat_database-uses-pgstat_io-time-counters.patchDownload
From 33abf29c4673175c5a71ce01b3a8d3bf961e268a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 21 Mar 2023 20:36:10 -0400
Subject: [PATCH v9 4/4] pgstat_database uses pgstat_io time counters

Use pgstat_io's pending counters to increment pgStatBlockWriteTime and
pgStatBlockReadTime.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20230321023451.7rzy4kjj2iktrg2r%40awork3.anarazel.de
---
 src/backend/utils/activity/pgstat_io.c | 14 ++++++++------
 src/include/pgstat.h                   |  4 ----
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index bc6787b3f3..7e7d7795fb 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -121,13 +121,11 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 
 		if (io_op == IOOP_WRITE)
 		{
-			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
 		}
 		else if (io_op == IOOP_READ)
 		{
-			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
 		}
@@ -178,15 +176,19 @@ pgstat_flush_io(bool nowait)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-				instr_time	time;
+				PgStat_Counter time;
 
 				bktype_shstats->counts[io_object][io_context][io_op] +=
 					PendingIOStats.counts[io_object][io_context][io_op];
 
-				time = PendingIOStats.pending_times[io_object][io_context][io_op];
+				time = INSTR_TIME_GET_MICROSEC(PendingIOStats.pending_times[io_object][io_context][io_op]);
 
-				bktype_shstats->times[io_object][io_context][io_op] +=
-					INSTR_TIME_GET_MICROSEC(time);
+				bktype_shstats->times[io_object][io_context][io_op] += time;
+
+				if (io_op == IOOP_WRITE)
+					pgStatBlockWriteTime += time;
+				else if (io_op == IOOP_READ)
+					pgStatBlockReadTime += time;
 			}
 		}
 	}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6f9ad0cf31..1d5aeb7dea 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -544,10 +544,6 @@ extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
 extern void pgstat_report_checksum_failure(void);
 extern void pgstat_report_connect(Oid dboid);
 
-#define pgstat_count_buffer_read_time(n)							\
-	(pgStatBlockReadTime += (n))
-#define pgstat_count_buffer_write_time(n)							\
-	(pgStatBlockWriteTime += (n))
 #define pgstat_count_conn_active_time(n)							\
 	(pgStatActiveTime += (n))
 #define pgstat_count_conn_txn_idle_time(n)							\
-- 
2.37.2

v9-0001-Count-IO-time-for-temp-relation-writes.patchtext/x-patch; charset=US-ASCII; name=v9-0001-Count-IO-time-for-temp-relation-writes.patchDownload
From d17524cdeec3297eb7d08876fd7acc41e0f0f1b5 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 21 Mar 2023 16:00:55 -0400
Subject: [PATCH v9 1/4] Count IO time for temp relation writes

Both pgstat_database and pgBufferUsage count IO timing for reads of
temporary relation blocks into local buffers. However, both failed to
count write IO timing for flushes of dirty local buffers. Add this to
achieve parity.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20230321023451.7rzy4kjj2iktrg2r%40awork3.anarazel.de
---
 src/backend/storage/buffer/localbuf.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3846d3eaca..3639296bc1 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -176,6 +176,8 @@ GetLocalVictimBuffer(void)
 	int			trycounter;
 	uint32		buf_state;
 	BufferDesc *bufHdr;
+	instr_time	io_start,
+				io_time;
 
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
@@ -239,6 +241,11 @@ GetLocalVictimBuffer(void)
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
 		/* And write... */
 		smgrwrite(oreln,
 				  BufTagGetForkNum(&bufHdr->tag),
@@ -252,6 +259,15 @@ GetLocalVictimBuffer(void)
 
 		/* Temporary table I/O does not use Buffer Access Strategies */
 		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
+
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+			INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+		}
+
 		pgBufferUsage.local_blks_written++;
 	}
 
-- 
2.37.2

v9-0003-Track-IO-times-in-pg_stat_io.patchtext/x-patch; charset=US-ASCII; name=v9-0003-Track-IO-times-in-pg_stat_io.patchDownload
From 023554df02490e1098008bdb2b512bbd78974589 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 31 Mar 2023 15:32:36 -0400
Subject: [PATCH v9 3/4] Track IO times in pg_stat_io

a9c70b46dbe and 8aaa04b32S added counting of IO operations to a new
view, pg_stat_io. Now, add IO timing for reads, writes, extends, and
fsyncs to pg_stat_io as well.

Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml           |  59 ++++++++++++++
 src/backend/catalog/system_views.sql   |   4 +
 src/backend/storage/buffer/bufmgr.c    |  66 +++++----------
 src/backend/storage/buffer/localbuf.c  |  31 +++-----
 src/backend/storage/smgr/md.c          |  31 +++++---
 src/backend/utils/activity/pgstat_io.c | 106 ++++++++++++++++++++-----
 src/backend/utils/adt/pgstatfuncs.c    |  68 ++++++++++++++--
 src/include/catalog/pg_proc.dat        |   6 +-
 src/include/pgstat.h                   |   8 +-
 src/test/regress/expected/rules.out    |   6 +-
 src/tools/pgindent/typedefs.list       |   1 +
 11 files changed, 278 insertions(+), 108 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bce9ae4661..7da135d2f2 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in read operations in milliseconds (if
+        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>write_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in write operations in milliseconds (if
+        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3838,6 +3862,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extend_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in extend operations in milliseconds (if
+        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3913,6 +3949,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>fsync_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in fsync operations in milliseconds (if
+        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3978,6 +4026,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    </itemizedlist>
   </para>
 
+  <note>
+   <para>
+    Columns tracking I/O time will only be non-zero when
+    <xref linkend="guc-track-io-timing"/> is enabled. The user should be
+    careful when referencing these columns in combination with their
+    corresponding IO operations in case <varname>track_io_timing</varname>
+    was not enabled for the entire time since the last stats reset.
+   </para>
+  </note>
+
+
 
  </sect2>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 6b098234f8..41aafe0b87 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1125,13 +1125,17 @@ SELECT
        b.io_object,
        b.io_context,
        b.reads,
+       b.read_time,
        b.writes,
+       b.write_time,
        b.extends,
+       b.extend_time,
        b.op_bytes,
        b.hits,
        b.evictions,
        b.reuses,
        b.fsyncs,
+       b.fsync_time,
        b.stats_reset
 FROM pg_stat_get_io() b;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ef69f21adf..a12d0c6c27 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1112,23 +1112,12 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 	else
 	{
-		instr_time	io_start,
-					io_time;
-
-		if (track_io_timing)
-			INSTR_TIME_SET_CURRENT(io_start);
+		instr_time	io_start = pgstat_prepare_io_time();
 
 		smgrread(smgr, forkNum, blockNum, bufBlock);
 
-		if (track_io_timing)
-		{
-			INSTR_TIME_SET_CURRENT(io_time);
-			INSTR_TIME_SUBTRACT(io_time, io_start);
-			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-			INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
-		}
-
-		pgstat_count_io_op(io_object, io_context, IOOP_READ);
+		pgstat_count_io_op_time(io_object, io_context,
+								IOOP_READ, io_start, 1);
 
 		/* check for garbage data */
 		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
@@ -1837,6 +1826,7 @@ ExtendBufferedRelShared(ExtendBufferedWhat eb,
 {
 	BlockNumber first_block;
 	IOContext	io_context = IOContextForStrategy(strategy);
+	instr_time	io_start;
 
 	LimitAdditionalPins(&extend_by);
 
@@ -2044,6 +2034,8 @@ ExtendBufferedRelShared(ExtendBufferedWhat eb,
 		}
 	}
 
+	io_start = pgstat_prepare_io_time();
+
 	/*
 	 * Note: if smgzerorextend fails, we will end up with buffers that are
 	 * allocated but not marked BM_VALID.  The next relation extension will
@@ -2066,6 +2058,9 @@ ExtendBufferedRelShared(ExtendBufferedWhat eb,
 	if (!(flags & EB_SKIP_EXTENSION_LOCK))
 		UnlockRelationForExtension(eb.rel, ExclusiveLock);
 
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_EXTEND,
+							io_start, extend_by);
+
 	/* Set BM_VALID, terminate IO, and wake up any waiters */
 	for (int i = 0; i < extend_by; i++)
 	{
@@ -2089,8 +2084,6 @@ ExtendBufferedRelShared(ExtendBufferedWhat eb,
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
-	pgstat_count_io_op_n(IOOBJECT_RELATION, io_context, IOOP_EXTEND,
-						 extend_by);
 
 	*extended_by = extend_by;
 
@@ -3344,8 +3337,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
-	instr_time	io_start,
-				io_time;
+	instr_time	io_start;
 	Block		bufBlock;
 	char	   *bufToWrite;
 	uint32		buf_state;
@@ -3420,10 +3412,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
-	if (track_io_timing)
-		INSTR_TIME_SET_CURRENT(io_start);
-	else
-		INSTR_TIME_SET_ZERO(io_start);
+	io_start = pgstat_prepare_io_time();
 
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
@@ -3452,15 +3441,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * When a strategy is not in use, the write can only be a "regular" write
 	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
 	 */
-	pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
-
-	if (track_io_timing)
-	{
-		INSTR_TIME_SET_CURRENT(io_time);
-		INSTR_TIME_SUBTRACT(io_time, io_start);
-		pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-		INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
-	}
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
+							IOOP_WRITE, io_start, 1);
 
 	pgBufferUsage.shared_blks_written++;
 
@@ -4062,14 +4044,13 @@ FlushRelationBuffers(Relation rel)
 {
 	int			i;
 	BufferDesc *bufHdr;
-	instr_time	io_start,
-				io_time;
 
 	if (RelationUsesLocalBuffers(rel))
 	{
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			uint32		buf_state;
+			instr_time	io_start;
 
 			bufHdr = GetLocalBufferDescriptor(i);
 			if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator) &&
@@ -4089,10 +4070,7 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-				if (track_io_timing)
-					INSTR_TIME_SET_CURRENT(io_start);
-				else
-					INSTR_TIME_SET_ZERO(io_start);
+				io_start = pgstat_prepare_io_time();
 
 				smgrwrite(RelationGetSmgr(rel),
 						  BufTagGetForkNum(&bufHdr->tag),
@@ -4100,19 +4078,13 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION,
+										IOCONTEXT_NORMAL, IOOP_WRITE,
+										io_start, 1);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
-
-				if (track_io_timing)
-				{
-					INSTR_TIME_SET_CURRENT(io_time);
-					INSTR_TIME_SUBTRACT(io_time, io_start);
-					pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-					INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
-				}
-
 				pgBufferUsage.local_blks_written++;
 
 				/* Pop the error context stack */
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3639296bc1..3d5bc9193d 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -176,8 +176,6 @@ GetLocalVictimBuffer(void)
 	int			trycounter;
 	uint32		buf_state;
 	BufferDesc *bufHdr;
-	instr_time	io_start,
-				io_time;
 
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
@@ -233,6 +231,7 @@ GetLocalVictimBuffer(void)
 	 */
 	if (buf_state & BM_DIRTY)
 	{
+		instr_time	io_start;
 		SMgrRelation oreln;
 		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
@@ -241,10 +240,7 @@ GetLocalVictimBuffer(void)
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-		if (track_io_timing)
-			INSTR_TIME_SET_CURRENT(io_start);
-		else
-			INSTR_TIME_SET_ZERO(io_start);
+		io_start = pgstat_prepare_io_time();
 
 		/* And write... */
 		smgrwrite(oreln,
@@ -253,21 +249,14 @@ GetLocalVictimBuffer(void)
 				  localpage,
 				  false);
 
+		/* Temporary table I/O does not use Buffer Access Strategies */
+		pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL,
+								IOOP_WRITE, io_start, 1);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-		/* Temporary table I/O does not use Buffer Access Strategies */
-		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
-
-		if (track_io_timing)
-		{
-			INSTR_TIME_SET_CURRENT(io_time);
-			INSTR_TIME_SUBTRACT(io_time, io_start);
-			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-			INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
-		}
-
 		pgBufferUsage.local_blks_written++;
 	}
 
@@ -325,6 +314,7 @@ ExtendBufferedRelLocal(ExtendBufferedWhat eb,
 					   uint32 *extended_by)
 {
 	BlockNumber first_block;
+	instr_time	io_start;
 
 	/* Initialize local buffers if first request in this session */
 	if (LocalBufHash == NULL)
@@ -415,9 +405,14 @@ ExtendBufferedRelLocal(ExtendBufferedWhat eb,
 		}
 	}
 
+	io_start = pgstat_prepare_io_time();
+
 	/* actually extend relation */
 	smgrzeroextend(eb.smgr, fork, first_block, extend_by, false);
 
+	pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EXTEND,
+							io_start, extend_by);
+
 	for (int i = 0; i < extend_by; i++)
 	{
 		Buffer		buf = buffers[i];
@@ -434,8 +429,6 @@ ExtendBufferedRelLocal(ExtendBufferedWhat eb,
 	*extended_by = extend_by;
 
 	pgBufferUsage.temp_blks_written += extend_by;
-	pgstat_count_io_op_n(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EXTEND,
-						 extend_by);
 
 	return first_block;
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1c2d1405f8..d9d0367c89 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1138,6 +1138,19 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		instr_time	io_start;
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		io_start = pgstat_prepare_io_time();
+
+		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(seg->mdfd_vfd))));
+
 		/*
 		 * We have no way of knowing if the current IOContext is
 		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
@@ -1149,16 +1162,8 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
 		 * backend fsyncs.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
-
-		ereport(DEBUG1,
-				(errmsg_internal("could not forward fsync request because request queue is full")));
-
-		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(data_sync_elevel(ERROR),
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->mdfd_vfd))));
+		pgstat_count_io_op_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+								IOOP_FSYNC, io_start, 1);
 	}
 }
 
@@ -1508,6 +1513,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 {
 	SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
 	File		file;
+	instr_time	io_start;
 	bool		need_to_close;
 	int			result,
 				save_errno;
@@ -1533,6 +1539,8 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
+	io_start = pgstat_prepare_io_time();
+
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
 	save_errno = errno;
@@ -1540,7 +1548,8 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
-	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
+	pgstat_count_io_op_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+							IOOP_FSYNC, io_start, 1);
 
 	errno = save_errno;
 	return result;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 0e4f26427c..bc6787b3f3 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -16,44 +16,55 @@
 
 #include "postgres.h"
 
+#include "executor/instrument.h"
+#include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
 
 
-static PgStat_BktypeIO PendingIOStats;
+typedef struct PgStat_PendingIO
+{
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_PendingIO;
+
+
+static PgStat_PendingIO PendingIOStats;
 bool		have_iostats = false;
 
 
 /*
  * Check that stats have not been counted for any combination of IOObject,
- * IOContext, and IOOp which are not tracked for the passed-in BackendType. The
- * passed-in PgStat_BktypeIO must contain stats from the BackendType specified
- * by the second parameter. Caller is responsible for locking the passed-in
- * PgStat_BktypeIO, if needed.
+ * IOContext, and IOOp which are not tracked for the passed-in BackendType. If
+ * stats are tracked for this combination and IO times are non-zero, counts
+ * should be non-zero.
+ *
+ * The passed-in PgStat_BktypeIO must contain stats from the BackendType
+ * specified by the second parameter. Caller is responsible for locking the
+ * passed-in PgStat_BktypeIO, if needed.
  */
 bool
 pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 							 BackendType bktype)
 {
-	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
-
 	for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
 	{
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
-			/*
-			 * Don't bother trying to skip to the next loop iteration if
-			 * pgstat_tracks_io_object() would return false here. We still
-			 * need to validate that each counter is zero anyway.
-			 */
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-				/* No stats, so nothing to validate */
-				if (backend_io->data[io_object][io_context][io_op] == 0)
+				/* we do track it */
+				if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				{
+					/* ensure that if IO times are non-zero, counts are > 0 */
+					if (backend_io->times[io_object][io_context][io_op] != 0 &&
+						backend_io->counts[io_object][io_context][io_op] <= 0)
+						return false;
+
 					continue;
+				}
 
-				/* There are stats and there shouldn't be */
-				if (!bktype_tracked ||
-					!pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				/* we don't track it, and it is not 0 */
+				if (backend_io->counts[io_object][io_context][io_op] != 0)
 					return false;
 			}
 		}
@@ -76,11 +87,57 @@ pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint3
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
-	PendingIOStats.data[io_object][io_context][io_op] += cnt;
+	PendingIOStats.counts[io_object][io_context][io_op] += cnt;
 
 	have_iostats = true;
 }
 
+instr_time
+pgstat_prepare_io_time(void)
+{
+	instr_time	io_start;
+
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
+
+	return io_start;
+}
+
+/*
+ * Like pgstat_count_io_op_n() except it also accumulates time.
+ */
+void
+pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
+						instr_time start_time, uint32 cnt)
+{
+	if (track_io_timing)
+	{
+		instr_time	io_time;
+
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, start_time);
+
+		if (io_op == IOOP_WRITE)
+		{
+			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object == IOOBJECT_RELATION)
+				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+		}
+		else if (io_op == IOOP_READ)
+		{
+			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object == IOOBJECT_RELATION)
+				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+		}
+
+		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op], io_time);
+	}
+
+	pgstat_count_io_op_n(io_object, io_context, io_op, cnt);
+}
+
 PgStat_IO *
 pgstat_fetch_stat_io(void)
 {
@@ -120,8 +177,17 @@ pgstat_flush_io(bool nowait)
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
-				bktype_shstats->data[io_object][io_context][io_op] +=
-					PendingIOStats.data[io_object][io_context][io_op];
+			{
+				instr_time	time;
+
+				bktype_shstats->counts[io_object][io_context][io_op] +=
+					PendingIOStats.counts[io_object][io_context][io_op];
+
+				time = PendingIOStats.pending_times[io_object][io_context][io_op];
+
+				bktype_shstats->times[io_object][io_context][io_op] +=
+					INSTR_TIME_GET_MICROSEC(time);
+			}
 		}
 	}
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eec9f3cf9b..1bef923527 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1256,13 +1256,17 @@ typedef enum io_stat_col
 	IO_COL_IO_OBJECT,
 	IO_COL_IO_CONTEXT,
 	IO_COL_READS,
+	IO_COL_READ_TIME,
 	IO_COL_WRITES,
+	IO_COL_WRITE_TIME,
 	IO_COL_EXTENDS,
+	IO_COL_EXTEND_TIME,
 	IO_COL_CONVERSION,
 	IO_COL_HITS,
 	IO_COL_EVICTIONS,
 	IO_COL_REUSES,
 	IO_COL_FSYNCS,
+	IO_COL_FSYNC_TIME,
 	IO_COL_RESET_TIME,
 	IO_NUM_COLUMNS,
 } io_stat_col;
@@ -1296,6 +1300,52 @@ pgstat_get_io_op_index(IOOp io_op)
 	pg_unreachable();
 }
 
+/*
+ * Get the number of the column containing IO times for the specified IOOp.
+ * This function encodes our assumption that IO time for an IOOp is displayed
+ * in the view in the column directly after the IOOp counts.
+ */
+static io_stat_col
+pgstat_get_io_time_index(IOOp io_op)
+{
+	return pgstat_get_io_op_index(io_op) + 1;
+}
+
+static inline
+PgStat_Counter
+pg_stat_micro_to_millisecs(PgStat_Counter val_microsec)
+{
+	return val_microsec * 0.001;
+}
+
+/*
+ * PgStat_BktypeIO->times contains IO times for IOOps. For simplicity this
+ * array has a spot for every IOOp. pgstat_tracks_io_time() is the source of
+ * truth for which IOOps have corresponding IO times.
+ */
+static IOOp
+pgstat_tracks_io_time(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_READ:
+			return IOOP_READ;
+		case IOOP_WRITE:
+			return IOOP_WRITE;
+		case IOOP_EXTEND:
+			return IOOP_EXTEND;
+		case IOOP_FSYNC:
+			return IOOP_FSYNC;
+		case IOOP_EVICT:
+		case IOOP_HIT:
+		case IOOP_REUSE:
+			return -1;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
 Datum
 pg_stat_get_io(PG_FUNCTION_ARGS)
 {
@@ -1363,20 +1413,28 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
-					int			col_idx = pgstat_get_io_op_index(io_op);
+					PgStat_Counter count = bktype_stats->counts[io_obj][io_context][io_op];
+					PgStat_Counter time = bktype_stats->times[io_obj][io_context][io_op];
+					int			op_idx = pgstat_get_io_op_index(io_op);
+					int			time_idx = pgstat_get_io_time_index(io_op);
 
 					/*
 					 * Some combinations of BackendType and IOOp, of IOContext
 					 * and IOOp, and of IOObject and IOOp are not tracked. Set
 					 * these cells in the view NULL.
 					 */
-					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+					if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+						values[op_idx] = Int64GetDatum(count);
+					else
+						nulls[op_idx] = true;
 
-					if (nulls[col_idx])
+					if (pgstat_tracks_io_time(io_op) == -1)
 						continue;
 
-					values[col_idx] =
-						Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
+					if (!nulls[op_idx])
+						values[time_idx] = Float8GetDatum(pg_stat_micro_to_millisecs(time));
+					else
+						nulls[time_idx] = true;
 				}
 
 				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9f2642201..664a7ab41d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5749,9 +5749,9 @@
   proname => 'pg_stat_get_io', provolatile => 'v',
   prorows => '30', proretset => 't',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,io_object,io_context,reads,writes,extends,op_bytes,hits,evictions,reuses,fsyncs,stats_reset}',
+  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_object,io_context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e79b8a34eb..6f9ad0cf31 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -306,7 +306,8 @@ typedef enum IOOp
 
 typedef struct PgStat_BktypeIO
 {
-	PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_IO
@@ -517,6 +518,10 @@ extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
 										 BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
 extern void pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt);
+extern instr_time pgstat_prepare_io_time(void);
+extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
+									IOOp io_op, instr_time time, uint32 cnt);
+
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
@@ -527,7 +532,6 @@ extern bool pgstat_tracks_io_object(BackendType bktype,
 extern bool pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 								IOContext io_context, IOOp io_op);
 
-
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ab1aebfde4..d939d8067e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1881,15 +1881,19 @@ pg_stat_io| SELECT backend_type,
     io_object,
     io_context,
     reads,
+    read_time,
     writes,
+    write_time,
     extends,
+    extend_time,
     op_bytes,
     hits,
     evictions,
     reuses,
     fsyncs,
+    fsync_time,
     stats_reset
-   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, writes, extends, op_bytes, hits, evictions, reuses, fsyncs, stats_reset);
+   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3219ea5f05..494cc66d5b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2053,6 +2053,7 @@ PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
 PgStat_PendingDroppedStatsItem
+PgStat_PendingIO
 PgStat_PendingWalStats
 PgStat_SLRUStats
 PgStat_ShmemControl
-- 
2.37.2

v9-0002-FlushRelationBuffers-counts-temp-relation-IO-writ.patchtext/x-patch; charset=US-ASCII; name=v9-0002-FlushRelationBuffers-counts-temp-relation-IO-writ.patchDownload
From b7960372c9ea618971dadb585e8dfab2afc425c6 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 21 Mar 2023 18:20:44 -0400
Subject: [PATCH v9 2/4] FlushRelationBuffers() counts temp relation IO writes

FlushRelationBuffers() seems to have omitted counting write IO stats for
both pgstat_database and pgBufferUsage. Add temporary relation IO blocks
written counts to pgBufferUsage and write IO timing counting to both
pgstat_database and pgBufferUsage.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20230321023451.7rzy4kjj2iktrg2r%40awork3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 908a8934bd..ef69f21adf 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4062,6 +4062,8 @@ FlushRelationBuffers(Relation rel)
 {
 	int			i;
 	BufferDesc *bufHdr;
+	instr_time	io_start,
+				io_time;
 
 	if (RelationUsesLocalBuffers(rel))
 	{
@@ -4087,6 +4089,11 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+				if (track_io_timing)
+					INSTR_TIME_SET_CURRENT(io_start);
+				else
+					INSTR_TIME_SET_ZERO(io_start);
+
 				smgrwrite(RelationGetSmgr(rel),
 						  BufTagGetForkNum(&bufHdr->tag),
 						  bufHdr->tag.blockNum,
@@ -4098,6 +4105,16 @@ FlushRelationBuffers(Relation rel)
 
 				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 
+				if (track_io_timing)
+				{
+					INSTR_TIME_SET_CURRENT(io_time);
+					INSTR_TIME_SUBTRACT(io_time, io_start);
+					pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+					INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+				}
+
+				pgBufferUsage.local_blks_written++;
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
-- 
2.37.2

#20Andres Freund
andres@anarazel.de
In reply to: Melanie Plageman (#19)
Re: Track IO times in pg_stat_io

Hi,

On 2023-04-07 12:17:38 -0400, Melanie Plageman wrote:

Attached v9 addresses review feedback as well as resolving merge
conflicts with recent relation extension patchset.

I've edited it a bit more:

- removed pgstat_tracks_io_time() and replaced it by returning the new
IO_COL_INVALID = -1 from pgstat_get_io_time_index() when there's no time

- moved PgStat_Counter count, time into the respective branches. It feels
somewhat wrong to access the time when we then decide there is no time.

- s/io_object/io_obj/ in pgstat_count_io_op_time(), combined with added
linebreaks, got the code to under 80 chars

- renamed pg_stat_microseconds_to_milliseconds to pg_stat_us_to_ms

- removed a spurious newline

- the times reported by pg_stat_io had their fractional part removed, due to
pg_stat_us_to_ms returning an integer

Verifying this, I saw that the write time visible in pg_stat_io didn't quite
match what I saw in log_checkpoints. But not always. Eventually I figured out
that that's not pg_stat_io's fault - log_checkpoint's write includes a lot of
things, including several other CheckPoint* routines, flushing WAL, asking the
kernel to flush things to disk... The biggest portion in my case were the
smgrwriteback() calls - which pg_stat_io doesn't track - oops.

Pushed up to and including 0003.

I've changed pgstat_count_io_op_time() to take a count and call
pgstat_count_io_op_n() so it can be used with smgrzeroextend(). I do
wish that the parameter to pgstat_count_io_op_n() was called "count" and
not "cnt"...

Heh.

I've also reordered the call site of pgstat_count_io_op_time() in a few
locations, but I have some questions about this.

Before, I didn't think it mattered much that we didn't finish counting
IO time until after setting BM_VALID or BM_DIRTY and unsetting
BM_IO_IN_PROGRESS. With the relation extension code doing this for many
buffers at once, though, I wondered if this will make the IO timing too
inaccurate.

As such, I've moved pgstat_count_io_op_time() to before we set those
flags in all locations. I did wonder if it is bad to prolong having the
buffer pinned and not having those flags set, though.

I went back and forth about this before. I think it's ok the way you did it.

I think 0004 needs a bit more work. At the very least we would have to swap
the order of pgstat_flush_pending_entries() and pgstat_flush_io() - entirely
doable. Unlike 0003, this doesn't make pg_stat_io more complete, or such, so
I'm inclined to leave it for 17. I think there might be some more
opportunities for having counts "flow down", like the patch does.

Greetings,

Andres Freund

#21Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Andres Freund (#9)
Re: Track IO times in pg_stat_io

Hi,

On Wed, Mar 08, 2023 at 04:34:38PM -0800, Andres Freund wrote:

On 2023-03-08 12:55:34 +0100, Drouvot, Bertrand wrote:

- pg_stat_io is "global" across all sessions. So, even if one session is doing some "testing" and needs to turn track_io_timing on, then it
is even not sure it's only reflecting its own testing (as other sessions may have turned it on too).

I think for 17 we should provide access to per-existing-connection pg_stat_io
stats, and also provide a database aggregated version. Neither should be
particularly hard.

FWIW, I think that would be great and plan to have a look at this (unless someone
beats me to it).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#22Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Bertrand Drouvot (#21)
Re: Track IO times in pg_stat_io

Hi,

On Fri, Aug 23, 2024 at 07:32:16AM +0000, Bertrand Drouvot wrote:

Hi,

On Wed, Mar 08, 2023 at 04:34:38PM -0800, Andres Freund wrote:

On 2023-03-08 12:55:34 +0100, Drouvot, Bertrand wrote:

- pg_stat_io is "global" across all sessions. So, even if one session is doing some "testing" and needs to turn track_io_timing on, then it
is even not sure it's only reflecting its own testing (as other sessions may have turned it on too).

I think for 17 we should provide access to per-existing-connection pg_stat_io
stats, and also provide a database aggregated version. Neither should be
particularly hard.

FWIW, I think that would be great and plan to have a look at this (unless someone
beats me to it).

FWIW, here is the patch proposal for per backend I/O statistics [1]/messages/by-id/ZtXR+CtkEVVE/LHF@ip-10-97-1-34.eu-west-3.compute.internal.

[1]: /messages/by-id/ZtXR+CtkEVVE/LHF@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#23Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#22)
Re: Track IO times in pg_stat_io

On Mon, Sep 02, 2024 at 03:00:32PM +0000, Bertrand Drouvot wrote:

On Fri, Aug 23, 2024 at 07:32:16AM +0000, Bertrand Drouvot wrote:

FWIW, I think that would be great and plan to have a look at this (unless someone
beats me to it).

FWIW, here is the patch proposal for per backend I/O statistics [1].

[1]: /messages/by-id/ZtXR+CtkEVVE/LHF@ip-10-97-1-34.eu-west-3.compute.internal

Cool, thanks!
--
Michael