Patch: add timing of buffer I/O requests

Started by Ants Aasmaabout 14 years ago89 messages

ants.aasma@eesti.ee

about 14 years ago

1 attachment(s)

Hi,

I know I'm late for the ongoing commitfest, but I thought I might as well
be early for the next one.

Attached is a patch against master that implements timing of shared buffer
fills and per relation stats collection of said timings. Buffer flushes are
timed as well but aren't exposed per table because of difficulty of
correctly attributing them.

Some notes on the implementation:
* The timing is done in bufmgr.c. Under high load some CPU contention will
get attributed to IO because the process doing the I/O won't get a time-
slice immediately.
* I decided to also account waiting for others doing the I/O as I/O waits.
They aren't double counted for per relation stats though.
* I added a GUC track_iotiming defaulting to off because timing isn't cheap
on all platforms.
* I used instr_time to keep the counts to be consistent with function
timings, but maybe both should convert to plain uint64 counts to make the
arithmetic code cleaner.
* Timings are exposed via EXPLAIN (BUFFERS), pg_stat_statements and
pg_statio_* views.
* I noticed there aren't any pg_statio_xact_* views. I don't have any need
for them myself, but thought I'd mention the inconsistency.
* The feature is really useful for me with auto_explain. Even better with
Peter's pg_stat_statements query cleaning applied.

I did some testing on an older AMD Athlon X2 BE-2350 and an Intel i5 M 540
to see the overhead. The AMD CPU doesn't have the necessary features for
fast user mode timing and has an overhead of about 900ns per gettimeofday
call. The Intel processor has an overhead of 22ns per call.

I tried a read only pgbench with scalefactor 50 and shared_buffers=32MB to
induce a lot of IO traffic that hits the OS cache. Seems like it should be
the worst case for this patch.

On the AMD I saw about 3% performance drop with timing enabled. On the
Intel machine I couldn't measure any statistically significant change. The
median was actually higher with timing enabled, but stddevs were large
enough to hide a couple of percent of performance loss. This needs some
further testing.

Preliminary results for the Intel machine with stddev (10 5min runs):
-c | master | io-stats
4 | 16521.53 ±4.49% | +1.16% ±3.21%
16 | 13923.49 ±5.98% | +0.56% ±7.11%

This is my first patch, so I hope I haven't missed anything too trivial.

--
Ants Aasma
ants.aasma@eesti.ee

Attachments:

io-stats.v1.patchtext/x-patch; charset=US-ASCII; name=io-stats.v1.patchDownload

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index e086fd8..971773e 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -4,7 +4,8 @@ MODULE_big = pg_stat_statements
 OBJS = pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.0.sql pg_stat_statements--unpackaged--1.0.sql
+DATA = pg_stat_statements--1.1.sql pg_stat_statements--1.0--1.1.sql \
+       pg_stat_statements--unpackaged--1.0.sql
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.0.sql b/contrib/pg_stat_statements/pg_stat_statements--1.0.sql
deleted file mode 100644
index 5294a01..0000000
--- a/contrib/pg_stat_statements/pg_stat_statements--1.0.sql
+++ /dev/null
@@ -1,39 +0,0 @@
-/* contrib/pg_stat_statements/pg_stat_statements--1.0.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit
-
--- Register functions.
-CREATE FUNCTION pg_stat_statements_reset()
-RETURNS void
-AS 'MODULE_PATHNAME'
-LANGUAGE C;
-
-CREATE FUNCTION pg_stat_statements(
-    OUT userid oid,
-    OUT dbid oid,
-    OUT query text,
-    OUT calls int8,
-    OUT total_time float8,
-    OUT rows int8,
-    OUT shared_blks_hit int8,
-    OUT shared_blks_read int8,
-    OUT shared_blks_written int8,
-    OUT local_blks_hit int8,
-    OUT local_blks_read int8,
-    OUT local_blks_written int8,
-    OUT temp_blks_read int8,
-    OUT temp_blks_written int8
-)
-RETURNS SETOF record
-AS 'MODULE_PATHNAME'
-LANGUAGE C;
-
--- Register a view on the function for ease of use.
-CREATE VIEW pg_stat_statements AS
-  SELECT * FROM pg_stat_statements();
-
-GRANT SELECT ON pg_stat_statements TO PUBLIC;
-
--- Don't want this to be available to non-superusers.
-REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
new file mode 100644
index 0000000..8bd2868
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
@@ -0,0 +1,41 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit
+
+-- Register functions.
+CREATE FUNCTION pg_stat_statements_reset()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION pg_stat_statements(
+    OUT userid oid,
+    OUT dbid oid,
+    OUT query text,
+    OUT calls int8,
+    OUT total_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT time_read float8,
+    OUT time_write float8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+-- Register a view on the function for ease of use.
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements();
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
+
+-- Don't want this to be available to non-superusers.
+REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 8dc3054..746dfd0 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -83,6 +83,8 @@ typedef struct Counters
 	int64		local_blks_written;		/* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;		/* # of temp blocks written */
+	double		time_read;		/* time spent reading in seconds */
+	double		time_write;		/* time spent writing in seconds */
 	double		usage;			/* usage factor */
 } Counters;
 
@@ -616,9 +618,9 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString,
 		instr_time	start;
 		instr_time	duration;
 		uint64		rows = 0;
-		BufferUsage bufusage;
+		BufferUsage bufusage_start, bufusage;
 
-		bufusage = pgBufferUsage;
+		bufusage_start = pgBufferUsage;
 		INSTR_TIME_SET_CURRENT(start);
 
 		nested_level++;
@@ -649,21 +651,25 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString,
 
 		/* calc differences of buffer counters. */
 		bufusage.shared_blks_hit =
-			pgBufferUsage.shared_blks_hit - bufusage.shared_blks_hit;
+			pgBufferUsage.shared_blks_hit - bufusage_start.shared_blks_hit;
 		bufusage.shared_blks_read =
-			pgBufferUsage.shared_blks_read - bufusage.shared_blks_read;
+			pgBufferUsage.shared_blks_read - bufusage_start.shared_blks_read;
 		bufusage.shared_blks_written =
-			pgBufferUsage.shared_blks_written - bufusage.shared_blks_written;
+			pgBufferUsage.shared_blks_written - bufusage_start.shared_blks_written;
 		bufusage.local_blks_hit =
-			pgBufferUsage.local_blks_hit - bufusage.local_blks_hit;
+			pgBufferUsage.local_blks_hit - bufusage_start.local_blks_hit;
 		bufusage.local_blks_read =
-			pgBufferUsage.local_blks_read - bufusage.local_blks_read;
+			pgBufferUsage.local_blks_read - bufusage_start.local_blks_read;
 		bufusage.local_blks_written =
-			pgBufferUsage.local_blks_written - bufusage.local_blks_written;
+			pgBufferUsage.local_blks_written - bufusage_start.local_blks_written;
 		bufusage.temp_blks_read =
-			pgBufferUsage.temp_blks_read - bufusage.temp_blks_read;
+			pgBufferUsage.temp_blks_read - bufusage_start.temp_blks_read;
 		bufusage.temp_blks_written =
-			pgBufferUsage.temp_blks_written - bufusage.temp_blks_written;
+			pgBufferUsage.temp_blks_written - bufusage_start.temp_blks_written;
+		bufusage.time_read = pgBufferUsage.time_read;
+		INSTR_TIME_SUBTRACT(bufusage.time_read, bufusage_start.time_read);
+		bufusage.time_write = pgBufferUsage.time_write;
+		INSTR_TIME_SUBTRACT(bufusage.time_write, bufusage_start.time_write);
 
 		pgss_store(queryString, INSTR_TIME_GET_DOUBLE(duration), rows,
 				   &bufusage);
@@ -772,6 +778,8 @@ pgss_store(const char *query, double total_time, uint64 rows,
 		e->counters.local_blks_written += bufusage->local_blks_written;
 		e->counters.temp_blks_read += bufusage->temp_blks_read;
 		e->counters.temp_blks_written += bufusage->temp_blks_written;
+		e->counters.time_read +=  INSTR_TIME_GET_DOUBLE(bufusage->time_read);
+		e->counters.time_write += INSTR_TIME_GET_DOUBLE(bufusage->time_write);
 		e->counters.usage += usage;
 		SpinLockRelease(&e->mutex);
 	}
@@ -793,7 +801,7 @@ pg_stat_statements_reset(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
-#define PG_STAT_STATEMENTS_COLS		14
+#define PG_STAT_STATEMENTS_COLS		16
 
 /*
  * Retrieve statement statistics.
@@ -893,6 +901,8 @@ pg_stat_statements(PG_FUNCTION_ARGS)
 		values[i++] = Int64GetDatumFast(tmp.local_blks_written);
 		values[i++] = Int64GetDatumFast(tmp.temp_blks_read);
 		values[i++] = Int64GetDatumFast(tmp.temp_blks_written);
+		values[i++] = Float8GetDatumFast(tmp.time_read);
+		values[i++] = Float8GetDatumFast(tmp.time_write);
 
 		Assert(i == PG_STAT_STATEMENTS_COLS);
 
diff --git a/contrib/pg_stat_statements/pg_stat_statements.control b/contrib/pg_stat_statements/pg_stat_statements.control
index 6f9a947..428fbb2 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.control
+++ b/contrib/pg_stat_statements/pg_stat_statements.control
@@ -1,5 +1,5 @@
 # pg_stat_statements extension
 comment = 'track execution statistics of all SQL statements executed'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/pg_stat_statements'
 relocatable = true
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d1e628f..2576385 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4162,6 +4162,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-iotiming" xreflabel="track_iotiming">
+      <term><varname>track_iotiming</varname> (<type>boolean</type>)</term>
+      <indexterm>
+       <primary><varname>track_iotiming</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Enables timing of database I/O calls.
+        This parameter is off by default, because it may cause significant
+        overhead if the platform doesn't support fast timing information.
+        Only superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)</term>
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b9dc1d2..3a8f0d8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -116,7 +116,8 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
    <productname>PostgreSQL</productname>'s <firstterm>statistics collector</>
    is a subsystem that supports collection and reporting of information about
    server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
+   and indexes in both disk-block and individual-row terms and time disk-block
+   accesses.  It also tracks
    the total number of rows in each table, and information about vacuum and
    analyze actions for each table.  It can also count calls to user-defined
    functions and the total time spent in each one.
@@ -145,6 +146,11 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
   </para>
 
   <para>
+   The parameter <xref linkend="guc-track-iotiming"> enables timing of I/O
+   requests.
+  </para>
+
+  <para>
    The parameter <xref linkend="guc-track-functions"> enables tracking of
    usage of user-defined functions.
   </para>
@@ -402,8 +408,9 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
       blocks read from that table, number of buffer hits, numbers of
       disk blocks read and buffer hits in all indexes of that table,
       numbers of disk blocks read and buffer hits from that table's
-      auxiliary TOAST table (if any), and numbers of disk blocks read
-      and buffer hits for the TOAST table's index.
+      auxiliary TOAST table (if any), numbers of disk blocks read
+      and buffer hits for the TOAST table's index and microseconds
+      spent reading the blocks for each category.
       </entry>
      </row>
 
@@ -423,7 +430,8 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
       <entry><structname>pg_statio_all_indexes</><indexterm><primary>pg_statio_all_indexes</primary></indexterm></entry>
       <entry>For each index in the current database,
       the table and index OID, schema, table and index name,
-      numbers of disk blocks read and buffer hits in that index.
+      numbers of disk blocks read, microseconds spent reading the blocks
+      and buffer hits in that index.
       </entry>
      </row>
 
@@ -522,7 +530,10 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
    handles disk I/O, data that is not in the
    <productname>PostgreSQL</> buffer cache might still reside in the
    kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
+   requiring a physical read.
+   Timing info shows how much user queries were delayed by buffer reads
+   in aggregate.
+   Users interested in obtaining more
    detailed information on <productname>PostgreSQL</> I/O behavior are
    advised to use the <productname>PostgreSQL</> statistics collector
    in combination with operating system utilities that allow insight
@@ -598,6 +609,15 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
      </row>
 
      <row>
+      <entry><literal><function>pg_stat_get_db_blocks_time</function>(<type>oid</type>)</literal></entry>
+      <entry><type>bigint</type></entry>
+      <entry>
+       Microseconds spent across all backends waiting for disk block fetch
+       requests for database
+      </entry>
+     </row>
+
+     <row>
       <entry><literal><function>pg_stat_get_db_tuples_returned</function>(<type>oid</type>)</literal></entry>
       <entry><type>bigint</type></entry>
       <entry>
@@ -783,6 +803,15 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
      </row>
 
      <row>
+      <entry><literal><function>pg_stat_get_blocks_time</function>(<type>oid</type>)</literal></entry>
+      <entry><type>bigint</type></entry>
+      <entry>
+       Microseconds spent across all backends waiting for disk block requests
+       for table or index
+      </entry>
+     </row>
+
+     <row>
       <entry><literal><function>pg_stat_get_last_vacuum_time</function>(<type>oid</type>)</literal></entry>
       <entry><type>timestamptz</type></entry>
       <entry>
@@ -924,6 +953,14 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
      </row>
 
      <row>
+      <entry><literal><function>pg_stat_get_xact_blocks_time</function>(<type>oid</type>)</literal></entry>
+      <entry><type>bigint</type></entry>
+      <entry>
+       Microseconds spent waiting for  disk block requests for table or index, in the current transaction
+      </entry>
+     </row>
+
+     <row>
        <!-- See also the entry for this in func.sgml -->
       <entry><literal><function>pg_backend_pid()</function></literal></entry>
       <entry><type>integer</type></entry>
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 5a0230c..2d60bdb 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -141,6 +141,20 @@
       <entry>Total number of temp blocks writes by the statement</entry>
      </row>
 
+     <row>
+      <entry><structfield>time_read</structfield></entry>
+      <entry><type>double precision</type></entry>
+      <entry></entry>
+      <entry>Total time spent by the statement for reading blocks, in seconds</entry>
+     </row>
+
+     <row>
+      <entry><structfield>time_write</structfield></entry>
+      <entry><type>double precision</type></entry>
+      <entry></entry>
+      <entry>Total time spent by the statement for writing out dirty blocks, in seconds</entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2253ca8..23362ea 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -415,15 +415,19 @@ CREATE VIEW pg_statio_all_tables AS
             pg_stat_get_blocks_fetched(C.oid) -
                     pg_stat_get_blocks_hit(C.oid) AS heap_blks_read,
             pg_stat_get_blocks_hit(C.oid) AS heap_blks_hit,
+            pg_stat_get_blocks_time(C.oid) AS heap_blks_time,
             sum(pg_stat_get_blocks_fetched(I.indexrelid) -
                     pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_read,
             sum(pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_hit,
+            sum(pg_stat_get_blocks_time(I.indexrelid))::bigint AS idx_blks_time,
             pg_stat_get_blocks_fetched(T.oid) -
                     pg_stat_get_blocks_hit(T.oid) AS toast_blks_read,
             pg_stat_get_blocks_hit(T.oid) AS toast_blks_hit,
+            pg_stat_get_blocks_time(T.oid) AS toast_blks_time,
             pg_stat_get_blocks_fetched(X.oid) -
                     pg_stat_get_blocks_hit(X.oid) AS tidx_blks_read,
-            pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit
+            pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit,
+            pg_stat_get_blocks_time(X.oid) AS tidx_blks_time
     FROM pg_class C LEFT JOIN
             pg_index I ON C.oid = I.indrelid LEFT JOIN
             pg_class T ON C.reltoastrelid = T.oid LEFT JOIN
@@ -477,7 +481,8 @@ CREATE VIEW pg_statio_all_indexes AS
             I.relname AS indexrelname,
             pg_stat_get_blocks_fetched(I.oid) -
                     pg_stat_get_blocks_hit(I.oid) AS idx_blks_read,
-            pg_stat_get_blocks_hit(I.oid) AS idx_blks_hit
+            pg_stat_get_blocks_hit(I.oid) AS idx_blks_hit,
+            pg_stat_get_blocks_time(I.oid) AS idx_blks_time
     FROM pg_class C JOIN
             pg_index X ON C.oid = X.indrelid JOIN
             pg_class I ON I.oid = X.indexrelid
@@ -501,7 +506,8 @@ CREATE VIEW pg_statio_all_sequences AS
             C.relname AS relname,
             pg_stat_get_blocks_fetched(C.oid) -
                     pg_stat_get_blocks_hit(C.oid) AS blks_read,
-            pg_stat_get_blocks_hit(C.oid) AS blks_hit
+            pg_stat_get_blocks_hit(C.oid) AS blks_hit,
+            pg_stat_get_blocks_time(C.oid) AS blks_time
     FROM pg_class C
             LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
     WHERE C.relkind = 'S';
@@ -568,6 +574,7 @@ CREATE VIEW pg_stat_database AS
             pg_stat_get_db_blocks_fetched(D.oid) -
                     pg_stat_get_db_blocks_hit(D.oid) AS blks_read,
             pg_stat_get_db_blocks_hit(D.oid) AS blks_hit,
+            pg_stat_get_db_blocks_time(D.oid) AS blks_time,
             pg_stat_get_db_tuples_returned(D.oid) AS tup_returned,
             pg_stat_get_db_tuples_fetched(D.oid) AS tup_fetched,
             pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e38de5c..b9c86b7 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1150,9 +1150,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 									 usage->local_blks_written);
 			bool		has_temp = (usage->temp_blks_read > 0 ||
 									usage->temp_blks_written);
+			bool		has_timing = (!INSTR_TIME_IS_ZERO(usage->time_read) ||
+									  !INSTR_TIME_IS_ZERO(usage->time_write));
+
 
 			/* Show only positive counter values. */
-			if (has_shared || has_local || has_temp)
+			if (has_shared || has_local || has_temp || has_timing)
 			{
 				appendStringInfoSpaces(es->str, es->indent * 2);
 				appendStringInfoString(es->str, "Buffers:");
@@ -1197,6 +1200,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
 						appendStringInfo(es->str, " written=%ld",
 										 usage->temp_blks_written);
 				}
+				if (has_timing)
+				{
+					appendStringInfoString(es->str, " timing");
+					if (!INSTR_TIME_IS_ZERO(usage->time_read)) {
+						appendStringInfo(es->str, " read=%0.2f",
+										 INSTR_TIME_GET_MILLISEC(usage->time_read));
+					}
+					if (!INSTR_TIME_IS_ZERO(usage->time_write)) {
+						appendStringInfo(es->str, " write=%0.2f",
+										 INSTR_TIME_GET_MILLISEC(usage->time_write));
+					}
+				}
 				appendStringInfoChar(es->str, '\n');
 			}
 		}
@@ -1210,6 +1225,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
 			ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
 			ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
+			ExplainPropertyFloat("Read Waits", INSTR_TIME_GET_MILLISEC(usage->time_read), 3, es);
+			ExplainPropertyFloat("Write Waits", INSTR_TIME_GET_MILLISEC(usage->time_write), 3, es);
 		}
 	}
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 9d30200..fdee851 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -135,4 +135,6 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	INSTR_TIME_ACCUM_DIFF(dst->time_read, add->time_read, sub->time_read);
+	INSTR_TIME_ACCUM_DIFF(dst->time_write, add->time_write, sub->time_write);
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 24f4cde..49ffc3e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3255,6 +3255,7 @@ pgstat_get_db_entry(Oid databaseid, bool create)
 		result->n_xact_rollback = 0;
 		result->n_blocks_fetched = 0;
 		result->n_blocks_hit = 0;
+		INSTR_TIME_SET_ZERO(result->blocks_time);
 		result->n_tuples_returned = 0;
 		result->n_tuples_fetched = 0;
 		result->n_tuples_inserted = 0;
@@ -3326,6 +3327,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 		result->changes_since_analyze = 0;
 		result->blocks_fetched = 0;
 		result->blocks_hit = 0;
+		INSTR_TIME_SET_ZERO(result->blocks_time);
 		result->vacuum_timestamp = 0;
 		result->vacuum_count = 0;
 		result->autovac_vacuum_timestamp = 0;
@@ -4014,6 +4016,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
 			tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
+			tabentry->blocks_time = tabmsg->t_counts.t_blocks_time;
 
 			tabentry->vacuum_timestamp = 0;
 			tabentry->vacuum_count = 0;
@@ -4041,6 +4044,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
 			tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
+			INSTR_TIME_ADD(tabentry->blocks_time, tabmsg->t_counts.t_blocks_time);
 		}
 
 		/* Clamp n_live_tuples in case of negative delta_live_tuples */
@@ -4058,6 +4062,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 		dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
 		dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
 		dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
+		INSTR_TIME_ADD(dbentry->blocks_time, tabmsg->t_counts.t_blocks_time);
 	}
 }
 
@@ -4171,6 +4176,7 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
 	dbentry->n_xact_rollback = 0;
 	dbentry->n_blocks_fetched = 0;
 	dbentry->n_blocks_hit = 0;
+	INSTR_TIME_SET_ZERO(dbentry->blocks_time);
 	dbentry->n_tuples_returned = 0;
 	dbentry->n_tuples_fetched = 0;
 	dbentry->n_tuples_inserted = 0;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 71fe8c6..32fbfd5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -67,6 +67,7 @@
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
+bool		track_iotiming = false;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -86,7 +87,7 @@ static volatile BufferDesc *PinCountWaitBuf = NULL;
 static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 				  ForkNumber forkNum, BlockNumber blockNum,
 				  ReadBufferMode mode, BufferAccessStrategy strategy,
-				  bool *hit);
+				  bool *hit, instr_time *io_time);
 static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
@@ -224,6 +225,7 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
 	bool		hit;
+	instr_time	io_time;
 	Buffer		buf;
 
 	/* Open it at the smgr level if not already done */
@@ -245,9 +247,11 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	pgstat_count_buffer_read(reln);
 	buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
+							forkNum, blockNum, mode, strategy, &hit, &io_time);
 	if (hit)
 		pgstat_count_buffer_hit(reln);
+	else
+		pgstat_count_buffer_time(reln, io_time);
 	return buf;
 }
 
@@ -267,11 +271,12 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BufferAccessStrategy strategy)
 {
 	bool		hit;
+	instr_time	io_time;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
 	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
-							 mode, strategy, &hit);
+							 mode, strategy, &hit, &io_time);
 }
 
 
@@ -279,19 +284,22 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
+ * If track_iotiming is enabled, *io_time is set to the time the read took.
  */
 static Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy, bool *hit, instr_time *io_time)
 {
 	volatile BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
+	instr_time io_start, io_end;
 
 	*hit = false;
+	INSTR_TIME_SET_ZERO(*io_time);
 
 	/* Make sure we will have room to remember the buffer pin */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -437,8 +445,18 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
+			if (track_iotiming)
+				INSTR_TIME_SET_CURRENT(io_start);
+
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			if (track_iotiming)
+			{
+				INSTR_TIME_SET_CURRENT(io_end);
+				INSTR_TIME_ACCUM_DIFF(*io_time, io_end, io_start);
+				INSTR_TIME_ADD(pgBufferUsage.time_read, *io_time);
+			}
+
 			/* check for garbage data */
 			if (!PageHeaderIsValid((PageHeader) bufBlock))
 			{
@@ -1860,6 +1878,7 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcontext;
+	instr_time io_start, io_end;
 
 	/*
 	 * Acquire the buffer's io_in_progress lock.  If StartBufferIO returns
@@ -1907,12 +1926,21 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 	buf->flags &= ~BM_JUST_DIRTIED;
 	UnlockBufHdr(buf);
 
+	if (track_iotiming)
+		INSTR_TIME_SET_CURRENT(io_start);
+
 	smgrwrite(reln,
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  (char *) BufHdrGetBlock(buf),
 			  false);
 
+	if (track_iotiming)
+	{
+		INSTR_TIME_SET_CURRENT(io_end);
+		INSTR_TIME_ACCUM_DIFF(pgBufferUsage.time_write, io_end, io_start);
+	}
+
 	pgBufferUsage.shared_blks_written++;
 
 	/*
@@ -2648,6 +2676,7 @@ WaitIO(volatile BufferDesc *buf)
 static bool
 StartBufferIO(volatile BufferDesc *buf, bool forInput)
 {
+	instr_time wait_start, wait_end;
 	Assert(!InProgressBuf);
 
 	for (;;)
@@ -2656,7 +2685,28 @@ StartBufferIO(volatile BufferDesc *buf, bool forInput)
 		 * Grab the io_in_progress lock so that other processes can wait for
 		 * me to finish the I/O.
 		 */
-		LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+		if (forInput && track_iotiming) {
+			/*
+			 * We need to time the lock wait to account for I/O waits where
+			 * someone else is doing the work for us. Conditional acquire
+			 * avoids double timing overhead when we do the I/O ourselves.
+			 */
+			if (!LWLockConditionalAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE)) {
+				INSTR_TIME_SET_CURRENT(wait_start);
+
+				LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+
+				/*
+				 * Only do backend local accounting, stats collector will get the
+				 * wait from the backend doing the I/O.
+				 */
+				INSTR_TIME_SET_CURRENT(wait_end);
+				INSTR_TIME_ACCUM_DIFF(pgBufferUsage.time_read, wait_end, wait_start);
+			}
+		} else {
+			LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+		}
+
 
 		LockBufHdr(buf);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 7792b33..4e42ce7 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -22,6 +22,7 @@
 #include "utils/builtins.h"
 #include "utils/inet.h"
 #include "utils/timestamp.h"
+#include "portability/instr_time.h"
 
 /* bogus ... these externs should be in a header file */
 extern Datum pg_stat_get_numscans(PG_FUNCTION_ARGS);
@@ -35,6 +36,7 @@ extern Datum pg_stat_get_live_tuples(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_dead_tuples(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_blocks_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_blocks_hit(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_blocks_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_last_vacuum_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_last_autovacuum_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_last_analyze_time(PG_FUNCTION_ARGS);
@@ -67,6 +69,7 @@ extern Datum pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_db_blocks_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS);
@@ -99,6 +102,7 @@ extern Datum pg_stat_get_xact_tuples_deleted(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_blocks_hit(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_xact_blocks_time(PG_FUNCTION_ARGS);
 
 extern Datum pg_stat_get_xact_function_calls(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_function_time(PG_FUNCTION_ARGS);
@@ -289,6 +293,22 @@ pg_stat_get_blocks_hit(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_blocks_time(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	double		result;
+	PgStat_StatTabEntry *tabentry;
+
+	if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+		result = 0;
+	else
+		/* Cast overflows in about 300'000 years of io time */
+		result = (int64) INSTR_TIME_GET_MICROSEC(tabentry->blocks_time);
+
+	PG_RETURN_INT64(result);
+}
+
+Datum
 pg_stat_get_last_vacuum_time(PG_FUNCTION_ARGS)
 {
 	Oid			relid = PG_GETARG_OID(0);
@@ -1083,6 +1103,22 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
 
 
 Datum
+pg_stat_get_db_blocks_time(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) INSTR_TIME_GET_MICROSEC(dbentry->blocks_time);
+
+	PG_RETURN_INT64(result);
+}
+
+
+Datum
 pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
 {
 	Oid			dbid = PG_GETARG_OID(0);
@@ -1482,6 +1518,21 @@ pg_stat_get_xact_blocks_hit(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_xact_blocks_time(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_TableStatus *tabentry;
+
+	if ((tabentry = find_tabstat_entry(relid)) == NULL)
+		result = 0;
+	else
+		result = (int64) INSTR_TIME_GET_MICROSEC(tabentry->t_counts.t_blocks_time);
+
+	PG_RETURN_INT64(result);
+}
+
+Datum
 pg_stat_get_xact_function_calls(PG_FUNCTION_ARGS)
 {
 	Oid			funcid = PG_GETARG_OID(0);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index da7b6d4..b78908c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1017,6 +1017,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_iotiming", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing information for database IO activity."),
+			NULL
+		},
+		&track_iotiming,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, STATS_COLLECTOR,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 315db46..9877afc 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -420,6 +420,7 @@
 
 #track_activities = on
 #track_counts = on
+#track_iotiming = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024 	# (change requires restart)
 #update_process_title = on
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 28e53b7..b837fa8 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2530,6 +2530,8 @@ DATA(insert OID = 1934 (  pg_stat_get_blocks_fetched	PGNSP PGUID 12 1 0 0 0 f f
 DESCR("statistics: number of blocks fetched");
 DATA(insert OID = 1935 (  pg_stat_get_blocks_hit		PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_blocks_hit _null_ _null_ _null_ ));
 DESCR("statistics: number of blocks found in cache");
+DATA(insert OID = 3947 (  pg_stat_get_blocks_time		PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_blocks_time _null_ _null_ _null_ ));
+DESCR("statistics: duration of block reads");
 DATA(insert OID = 2781 (  pg_stat_get_last_vacuum_time PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 1184 "26" _null_ _null_ _null_ _null_	pg_stat_get_last_vacuum_time _null_ _null_ _null_ ));
 DESCR("statistics: last manual vacuum time for a table");
 DATA(insert OID = 2782 (  pg_stat_get_last_autovacuum_time PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 1184 "26" _null_ _null_ _null_ _null_	pg_stat_get_last_autovacuum_time _null_ _null_ _null_ ));
@@ -2584,6 +2586,8 @@ DATA(insert OID = 1944 (  pg_stat_get_db_blocks_fetched PGNSP PGUID 12 1 0 0 0 f
 DESCR("statistics: blocks fetched for database");
 DATA(insert OID = 1945 (  pg_stat_get_db_blocks_hit		PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_db_blocks_hit _null_ _null_ _null_ ));
 DESCR("statistics: blocks found in cache for database");
+DATA(insert OID = 3948 (  pg_stat_get_db_blocks_time	PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_db_blocks_time _null_ _null_ _null_ ));
+DESCR("statistics: duration of block reads for database");
 DATA(insert OID = 2758 (  pg_stat_get_db_tuples_returned PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_db_tuples_returned _null_ _null_ _null_ ));
 DESCR("statistics: tuples returned for database");
 DATA(insert OID = 2759 (  pg_stat_get_db_tuples_fetched PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_db_tuples_fetched _null_ _null_ _null_ ));
@@ -2652,6 +2656,8 @@ DATA(insert OID = 3044 (  pg_stat_get_xact_blocks_fetched		PGNSP PGUID 12 1 0 0
 DESCR("statistics: number of blocks fetched in current transaction");
 DATA(insert OID = 3045 (  pg_stat_get_xact_blocks_hit			PGNSP PGUID 12 1 0 0 0 f f f t f v 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_hit _null_ _null_ _null_ ));
 DESCR("statistics: number of blocks found in cache in current transaction");
+DATA(insert OID = 3949 (  pg_stat_get_xact_blocks_time			PGNSP PGUID 12 1 0 0 0 f f f t f v 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_time _null_ _null_ _null_ ));
+DESCR("statistics: duration of block reads in current transaction");
 DATA(insert OID = 3046 (  pg_stat_get_xact_function_calls		PGNSP PGUID 12 1 0 0 0 f f f t f v 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_xact_function_calls _null_ _null_ _null_ ));
 DESCR("statistics: number of function calls in current transaction");
 DATA(insert OID = 3047 (  pg_stat_get_xact_function_time		PGNSP PGUID 12 1 0 0 0 f f f t f v 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_xact_function_time _null_ _null_ _null_ ));
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 22c3106..b301be3 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -26,6 +26,8 @@ typedef struct BufferUsage
 	long		local_blks_written;		/* # of local disk blocks written */
 	long		temp_blks_read; /* # of temp blocks read */
 	long		temp_blks_written;		/* # of temp blocks written */
+	instr_time	time_read;	/* time spent reading */
+	instr_time	time_write;	/* time spent writing */
 } BufferUsage;
 
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 651b7d9..6dc9a2e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -95,6 +95,7 @@ typedef struct PgStat_TableCounts
 
 	PgStat_Counter t_blocks_fetched;
 	PgStat_Counter t_blocks_hit;
+	instr_time     t_blocks_time;
 } PgStat_TableCounts;
 
 /* Possible targets for resetting cluster-wide shared values */
@@ -496,6 +497,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_xact_rollback;
 	PgStat_Counter n_blocks_fetched;
 	PgStat_Counter n_blocks_hit;
+	instr_time	blocks_time;
 	PgStat_Counter n_tuples_returned;
 	PgStat_Counter n_tuples_fetched;
 	PgStat_Counter n_tuples_inserted;
@@ -543,6 +545,7 @@ typedef struct PgStat_StatTabEntry
 
 	PgStat_Counter blocks_fetched;
 	PgStat_Counter blocks_hit;
+	instr_time	blocks_time;
 
 	TimestampTz vacuum_timestamp;		/* user initiated vacuum */
 	PgStat_Counter vacuum_count;
@@ -765,6 +768,13 @@ extern void pgstat_initstats(Relation rel);
 		if ((rel)->pgstat_info != NULL)								\
 			(rel)->pgstat_info->t_counts.t_blocks_hit++;			\
 	} while (0)
+#define pgstat_count_buffer_time(rel, io)							\
+	do {															\
+		if ((rel)->pgstat_info != NULL)								\
+			INSTR_TIME_ADD(											\
+				(rel)->pgstat_info->t_counts.t_blocks_time,			\
+				(io));												\
+	} while (0)
 
 extern void pgstat_count_heap_insert(Relation rel, int n);
 extern void pgstat_count_heap_update(Relation rel, bool hot);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 49b5d31..4009cc7 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -48,6 +48,7 @@ extern PGDLLIMPORT int NBuffers;
 extern bool zero_damaged_pages;
 extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
+extern bool track_iotiming;
 extern int	target_prefetch_pages;
 
 /* in buf_init.c */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 454e1f9..983df54 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1296,7 +1296,7 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
  pg_stat_all_indexes             | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, pg_stat_get_numscans(i.oid) AS idx_scan, pg_stat_get_tuples_returned(i.oid) AS idx_tup_read, pg_stat_get_tuples_fetched(i.oid) AS idx_tup_fetch FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
  pg_stat_all_tables              | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, pg_stat_get_numscans(c.oid) AS seq_scan, pg_stat_get_tuples_returned(c.oid) AS seq_tup_read, (sum(pg_stat_get_numscans(i.indexrelid)))::bigint AS idx_scan, ((sum(pg_stat_get_tuples_fetched(i.indexrelid)))::bigint + pg_stat_get_tuples_fetched(c.oid)) AS idx_tup_fetch, pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins, pg_stat_get_tuples_updated(c.oid) AS n_tup_upd, pg_stat_get_tuples_deleted(c.oid) AS n_tup_del, pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd, pg_stat_get_live_tuples(c.oid) AS n_live_tup, pg_stat_get_dead_tuples(c.oid) AS n_dead_tup, pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum, pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum, pg_stat_get_last_analyze_time(c.oid) AS last_analyze, pg_stat_get_last_autoanalyze_time(c.oid) AS last_autoanalyze, pg_stat_get_vacuum_count(c.oid) AS vacuum_count, pg_stat_get_autovacuum_count(c.oid) AS autovacuum_count, pg_stat_get_analyze_count(c.oid) AS analyze_count, pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count FROM ((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname;
  pg_stat_bgwriter                | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
- pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
+ pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_blocks_time(d.oid) AS blks_time, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
  pg_stat_database_conflicts      | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
  pg_stat_replication             | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_hostname, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.replay_location, w.sync_priority, w.sync_state FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_hostname, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
  pg_stat_sys_indexes             | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
@@ -1308,15 +1308,15 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
  pg_stat_xact_sys_tables         | SELECT pg_stat_xact_all_tables.relid, pg_stat_xact_all_tables.schemaname, pg_stat_xact_all_tables.relname, pg_stat_xact_all_tables.seq_scan, pg_stat_xact_all_tables.seq_tup_read, pg_stat_xact_all_tables.idx_scan, pg_stat_xact_all_tables.idx_tup_fetch, pg_stat_xact_all_tables.n_tup_ins, pg_stat_xact_all_tables.n_tup_upd, pg_stat_xact_all_tables.n_tup_del, pg_stat_xact_all_tables.n_tup_hot_upd FROM pg_stat_xact_all_tables WHERE ((pg_stat_xact_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_xact_all_tables.schemaname ~ '^pg_toast'::text));
  pg_stat_xact_user_functions     | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_xact_function_calls(p.oid) AS calls, (pg_stat_get_xact_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_xact_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_xact_function_calls(p.oid) IS NOT NULL));
  pg_stat_xact_user_tables        | SELECT pg_stat_xact_all_tables.relid, pg_stat_xact_all_tables.schemaname, pg_stat_xact_all_tables.relname, pg_stat_xact_all_tables.seq_scan, pg_stat_xact_all_tables.seq_tup_read, pg_stat_xact_all_tables.idx_scan, pg_stat_xact_all_tables.idx_tup_fetch, pg_stat_xact_all_tables.n_tup_ins, pg_stat_xact_all_tables.n_tup_upd, pg_stat_xact_all_tables.n_tup_del, pg_stat_xact_all_tables.n_tup_hot_upd FROM pg_stat_xact_all_tables WHERE ((pg_stat_xact_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_xact_all_tables.schemaname !~ '^pg_toast'::text));
- pg_statio_all_indexes           | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, (pg_stat_get_blocks_fetched(i.oid) - pg_stat_get_blocks_hit(i.oid)) AS idx_blks_read, pg_stat_get_blocks_hit(i.oid) AS idx_blks_hit FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
- pg_statio_all_sequences         | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS blks_read, pg_stat_get_blocks_hit(c.oid) AS blks_hit FROM (pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = 'S'::"char");
- pg_statio_all_tables            | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS heap_blks_read, pg_stat_get_blocks_hit(c.oid) AS heap_blks_hit, (sum((pg_stat_get_blocks_fetched(i.indexrelid) - pg_stat_get_blocks_hit(i.indexrelid))))::bigint AS idx_blks_read, (sum(pg_stat_get_blocks_hit(i.indexrelid)))::bigint AS idx_blks_hit, (pg_stat_get_blocks_fetched(t.oid) - pg_stat_get_blocks_hit(t.oid)) AS toast_blks_read, pg_stat_get_blocks_hit(t.oid) AS toast_blks_hit, (pg_stat_get_blocks_fetched(x.oid) - pg_stat_get_blocks_hit(x.oid)) AS tidx_blks_read, pg_stat_get_blocks_hit(x.oid) AS tidx_blks_hit FROM ((((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_class t ON ((c.reltoastrelid = t.oid))) LEFT JOIN pg_class x ON ((t.reltoastidxid = x.oid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname, t.oid, x.oid;
- pg_statio_sys_indexes           | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_indexes.schemaname ~ '^pg_toast'::text));
- pg_statio_sys_sequences         | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_sequences.schemaname ~ '^pg_toast'::text));
- pg_statio_sys_tables            | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_tables.schemaname ~ '^pg_toast'::text));
- pg_statio_user_indexes          | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_indexes.schemaname !~ '^pg_toast'::text));
- pg_statio_user_sequences        | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_sequences.schemaname !~ '^pg_toast'::text));
- pg_statio_user_tables           | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_tables.schemaname !~ '^pg_toast'::text));
+ pg_statio_all_indexes           | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, (pg_stat_get_blocks_fetched(i.oid) - pg_stat_get_blocks_hit(i.oid)) AS idx_blks_read, pg_stat_get_blocks_hit(i.oid) AS idx_blks_hit, pg_stat_get_blocks_time(i.oid) AS idx_blks_time FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
+ pg_statio_all_sequences         | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS blks_read, pg_stat_get_blocks_hit(c.oid) AS blks_hit, pg_stat_get_blocks_time(c.oid) AS blks_time FROM (pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = 'S'::"char");
+ pg_statio_all_tables            | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS heap_blks_read, pg_stat_get_blocks_hit(c.oid) AS heap_blks_hit, pg_stat_get_blocks_time(c.oid) AS heap_blks_time, (sum((pg_stat_get_blocks_fetched(i.indexrelid) - pg_stat_get_blocks_hit(i.indexrelid))))::bigint AS idx_blks_read, (sum(pg_stat_get_blocks_hit(i.indexrelid)))::bigint AS idx_blks_hit, (sum(pg_stat_get_blocks_time(i.indexrelid)))::bigint AS idx_blks_time, (pg_stat_get_blocks_fetched(t.oid) - pg_stat_get_blocks_hit(t.oid)) AS toast_blks_read, pg_stat_get_blocks_hit(t.oid) AS toast_blks_hit, pg_stat_get_blocks_time(t.oid) AS toast_blks_time, (pg_stat_get_blocks_fetched(x.oid) - pg_stat_get_blocks_hit(x.oid)) AS tidx_blks_read, pg_stat_get_blocks_hit(x.oid) AS tidx_blks_hit, pg_stat_get_blocks_time(x.oid) AS tidx_blks_time FROM ((((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_class t ON ((c.reltoastrelid = t.oid))) LEFT JOIN pg_class x ON ((t.reltoastidxid = x.oid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname, t.oid, x.oid;
+ pg_statio_sys_indexes           | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit, pg_statio_all_indexes.idx_blks_time FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_indexes.schemaname ~ '^pg_toast'::text));
+ pg_statio_sys_sequences         | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit, pg_statio_all_sequences.blks_time FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_sequences.schemaname ~ '^pg_toast'::text));
+ pg_statio_sys_tables            | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.heap_blks_time, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.idx_blks_time, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.toast_blks_time, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit, pg_statio_all_tables.tidx_blks_time FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_tables.schemaname ~ '^pg_toast'::text));
+ pg_statio_user_indexes          | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit, pg_statio_all_indexes.idx_blks_time FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_indexes.schemaname !~ '^pg_toast'::text));
+ pg_statio_user_sequences        | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit, pg_statio_all_sequences.blks_time FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_sequences.schemaname !~ '^pg_toast'::text));
+ pg_statio_user_tables           | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.heap_blks_time, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.idx_blks_time, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.toast_blks_time, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit, pg_statio_all_tables.tidx_blks_time FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_tables.schemaname !~ '^pg_toast'::text));
  pg_stats                        | SELECT n.nspname AS schemaname, c.relname AS tablename, a.attname, s.stainherit AS inherited, s.stanullfrac AS null_frac, s.stawidth AS avg_width, s.stadistinct AS n_distinct, CASE WHEN (s.stakind1 = ANY (ARRAY[1, 4])) THEN s.stavalues1 WHEN (s.stakind2 = ANY (ARRAY[1, 4])) THEN s.stavalues2 WHEN (s.stakind3 = ANY (ARRAY[1, 4])) THEN s.stavalues3 WHEN (s.stakind4 = ANY (ARRAY[1, 4])) THEN s.stavalues4 ELSE NULL::anyarray END AS most_common_vals, CASE WHEN (s.stakind1 = ANY (ARRAY[1, 4])) THEN s.stanumbers1 WHEN (s.stakind2 = ANY (ARRAY[1, 4])) THEN s.stanumbers2 WHEN (s.stakind3 = ANY (ARRAY[1, 4])) THEN s.stanumbers3 WHEN (s.stakind4 = ANY (ARRAY[1, 4])) THEN s.stanumbers4 ELSE NULL::real[] END AS most_common_freqs, CASE WHEN (s.stakind1 = 2) THEN s.stavalues1 WHEN (s.stakind2 = 2) THEN s.stavalues2 WHEN (s.stakind3 = 2) THEN s.stavalues3 WHEN (s.stakind4 = 2) THEN s.stavalues4 ELSE NULL::anyarray END AS histogram_bounds, CASE WHEN (s.stakind1 = 3) THEN s.stanumbers1[1] WHEN (s.stakind2 = 3) THEN s.stanumbers2[1] WHEN (s.stakind3 = 3) THEN s.stanumbers3[1] WHEN (s.stakind4 = 3) THEN s.stanumbers4[1] ELSE NULL::real END AS correlation FROM (((pg_statistic s JOIN pg_class c ON ((c.oid = s.starelid))) JOIN pg_attribute a ON (((c.oid = a.attrelid) AND (a.attnum = s.staattnum)))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE ((NOT a.attisdropped) AND has_column_privilege(c.oid, a.attnum, 'select'::text));
  pg_tables                       | SELECT n.nspname AS schemaname, c.relname AS tablename, pg_get_userbyid(c.relowner) AS tableowner, t.spcname AS tablespace, c.relhasindex AS hasindexes, c.relhasrules AS hasrules, c.relhastriggers AS hastriggers FROM ((pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) LEFT JOIN pg_tablespace t ON ((t.oid = c.reltablespace))) WHERE (c.relkind = 'r'::"char");
  pg_timezone_abbrevs             | SELECT pg_timezone_abbrevs.abbrev, pg_timezone_abbrevs.utc_offset, pg_timezone_abbrevs.is_dst FROM pg_timezone_abbrevs() pg_timezone_abbrevs(abbrev, utc_offset, is_dst);

Greg Smith

greg@2ndQuadrant.com

about 14 years ago

In reply to: Ants Aasma (#1)

Re: Patch: add timing of buffer I/O requests

On 11/27/2011 04:39 PM, Ants Aasma wrote:

On the AMD I saw about 3% performance drop with timing enabled. On the
Intel machine I couldn't measure any statistically significant change.

Oh no, it's party pooper time again. Sorry I have to be the one to do
it this round. The real problem with this whole area is that we know
there are systems floating around where the amount of time taken to grab
timestamps like this is just terrible. I've been annoyed enough by that
problem to spend some time digging into why that is--seems to be a bunch
of trivia around the multiple ways to collect time info on x86
systems--and after this CommitFest is over I was already hoping to dig
through my notes and start quantifying that more. So you can't really
prove the overhead of this approach is acceptable just by showing two
examples; we need to find one of the really terrible clocks and test
there to get a real feel for the worst-case.

I recall a patch similar to this one was submitted by Greg Stark some
time ago. It used the info for different reasons--to try and figure out
whether reads were cached or not--but I believe it withered rather than
being implemented mainly because it ran into the same fundamental
roadblocks here. My memory could be wrong here, there were also
concerns about what the data would be used for.

I've been thinking about a few ways to try and cope with this whole
class of timing problem:

-Document the underlying problem and known workarounds, provide a way to
test how bad the overhead is, and just throw our hands up and say
"sorry, you just can't instrument like this" if someone has a slow system.

-Have one of the PostgreSQL background processes keep track of a time
estimate on its own, only periodically pausing to sync against the real
time. Then most calls to gettimeofday() can use that value instead. I
was thinking of that idea for slightly longer running things though; I
doubt that can be made accurate enough to test instrument buffer

And while I hate to kick off massive bike-shedding in your direction,
I'm also afraid this area--collecting stats about how long individual
operations take--will need a much wider ranging approach than just
looking at the buffer cache ones. If you step back and ask "what do
people expect here?", there's a pretty large number who really want
something like Oracle's v$session_wait and v$system_event interface for
finding the underlying source of slow things. There's enough demand for
that that EnterpriseDB has even done some work in this area too; what
I've been told about it suggests the code isn't a great fit for
contribution to community PostgreSQL though. Like I said, this area is
really messy and hard to get right.

Something more ambitious like the v$ stuff would also take care of what
you're doing here; I'm not sure that what you've done helps built it
though. Please don't take that personally. Part of one of my own
instrumentation patches recently was rejected out of hand for the same
reason, just not being general enough.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

Tomas Vondra

tv@fuzzy.cz

about 14 years ago

In reply to: Greg Smith (#2)

Re: Patch: add timing of buffer I/O requests

On 28 Listopad 2011, 8:54, Greg Smith wrote:

-Have one of the PostgreSQL background processes keep track of a time
estimate on its own, only periodically pausing to sync against the real
time. Then most calls to gettimeofday() can use that value instead. I
was thinking of that idea for slightly longer running things though; I
doubt that can be made accurate enough to test instrument buffer

What about random sampling, i.e. "measure just 5% of the events" or
something like that? Sure, it's not exact but it significantly reduces the
overhead. And it might be a config parameter, so the user might decide how
precise results are needed, and even consider how fast the clocks are.

Something more ambitious like the v$ stuff would also take care of what
you're doing here; I'm not sure that what you've done helps built it
though. Please don't take that personally. Part of one of my own
instrumentation patches recently was rejected out of hand for the same
reason, just not being general enough.

Yes, that'd be significant improvement. The wait-event stuff is very
useful and changes the tuning significantly.

Tomas

Robert Haas

robertmhaas@gmail.com

about 14 years ago

In reply to: Greg Smith (#2)

Re: Patch: add timing of buffer I/O requests

On Mon, Nov 28, 2011 at 2:54 AM, Greg Smith <greg@2ndquadrant.com> wrote:

The real problem with this whole area is that we know there are
systems floating around where the amount of time taken to grab timestamps
like this is just terrible.

Assuming the feature is off by default (and I can't imagine we'd
consider anything else), I don't see why that should be cause for
concern. If the instrumentation creates too much system load, then
don't use it: simple as that. A more interesting question is "how
much load does this feature create even when it's turned off?".

The other big problem for a patch of this sort is that it would bloat
the stats file. I think we really need to come up with a more
scalable alternative to the current system, but I haven't looked the
current system in enough detail to have a clear feeling about what
such an alternative would look like.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Greg Stark

stark@mit.edu

about 14 years ago

In reply to: Greg Smith (#2)

Re: Patch: add timing of buffer I/O requests

On Nov 28, 2011 8:55 AM, "Greg Smith" <greg@2ndquadrant.com> wrote:

On 11/27/2011 04:39 PM, Ants Aasma wrote:

On the AMD I saw about 3% performance drop with timing enabled. On the
Intel machine I couldn't measure any statistically significant change.

Oh no, it's party pooper time again. Sorry I have to be the one to do it

this round. The real problem with this whole area is that we know there
are systems floating around where the amount of time taken to grab
timestamps like this is just terrible.

I believe on most systems on modern linux kernels gettimeofday an its ilk
will be a vsyscall and nearly as fast as a regular function call.

I recall a patch similar to this one was submitted by Greg Stark some

time ago. It used the info for different reasons--to try and figure out
whether reads were cached or not--but I believe it withered rather than
being implemented mainly because it ran into the same fundamental
roadblocks here. My memory could be wrong here, there were also concerns
about what the data would be used for.

I speculated about doing that but never did. I had an experimental patch
using mincore to do what you describe but it wasn't intended for production
code I think. The only real patch was to use getrusage which I still intend
to commit but it doesn't tell you the time spent in I/O -- though it does
tell you the sys time which should be similar.

Tomas Vondra

tv@fuzzy.cz

about 14 years ago

In reply to: Greg Stark (#5)

Re: Patch: add timing of buffer I/O requests

On 28 Listopad 2011, 15:40, Greg Stark wrote:

On Nov 28, 2011 8:55 AM, "Greg Smith" <greg@2ndquadrant.com> wrote:

On 11/27/2011 04:39 PM, Ants Aasma wrote:

On the AMD I saw about 3% performance drop with timing enabled. On the
Intel machine I couldn't measure any statistically significant change.

Oh no, it's party pooper time again. Sorry I have to be the one to do
it

this round. The real problem with this whole area is that we know there
are systems floating around where the amount of time taken to grab
timestamps like this is just terrible.

I believe on most systems on modern linux kernels gettimeofday an its ilk
will be a vsyscall and nearly as fast as a regular function call.

AFAIK a vsyscall should be faster than a regular syscall. It does not need
to switch to kernel space at all, it "just" reads the data from a shared
page. The problem is that this is Linux-specific - for example FreeBSD
does not have vsyscall at all (it's actually one of the Linux-isms
mentioned here: http://wiki.freebsd.org/AvoidingLinuxisms).

There's also something called VDSO, that (among other things) uses
vsyscall if availabe, or the best implementation available. So there are
platforms that do not provide vsyscall, and in that case it'd be just as
slow as a regular syscall :(

I wouldn't expect a patch that works fine on Linux but not on other
platforms to be accepted, unless there's a compile-time configure switch
(--with-timings) that'd allow to disable that.

Another option would be to reimplement the vsyscall, even on platforms
that don't provide it. The principle is actually quite simple - allocate a
shared memory, store there a current time and update it whenever a clock
interrupt happens. This is basically what Greg suggested in one of the
previous posts, where "regularly" means "on every interrupt". Greg was
worried about the precision, but this should be just fine I guess. It's
the precision you get on Linux, anyway ...

I recall a patch similar to this one was submitted by Greg Stark some

time ago. It used the info for different reasons--to try and figure out
whether reads were cached or not--but I believe it withered rather than
being implemented mainly because it ran into the same fundamental
roadblocks here. My memory could be wrong here, there were also concerns
about what the data would be used for.

The difficulty when distinguishing whether the reads were cached or not is
the price we pay for using filesystem cache instead of managing our own.
Not sure if this can be solved just by measuring the latency - with
spinners it's quite easy, the differences are rather huge (and it's not
difficult to derive that even from pgbench log). But with SSDs, multiple
tablespaces on different storage, etc. it gets much harder.

Tomas

Tom Lane

tgl@sss.pgh.pa.us

about 14 years ago

In reply to: Robert Haas (#4)

Re: Patch: add timing of buffer I/O requests

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Nov 28, 2011 at 2:54 AM, Greg Smith <greg@2ndquadrant.com> wrote:

The real problem with this whole area is that we know there are
systems floating around where the amount of time taken to grab timestamps
like this is just terrible.

Assuming the feature is off by default (and I can't imagine we'd
consider anything else), I don't see why that should be cause for
concern. If the instrumentation creates too much system load, then
don't use it: simple as that. A more interesting question is "how
much load does this feature create even when it's turned off?".

Right. I see that the code already has a switch to skip the
gettimeofday calls, so the objection is only problematic if the added
overhead is significant even with the switch off. I would worry mainly
about the added time/space to deal with the extra stats counters.

The other big problem for a patch of this sort is that it would bloat
the stats file.

Yes. Which begs the question of why we need to measure this per-table.
I would think per-tablespace would be sufficient.

regards, tom lane

Martijn van Oosterhout

kleptog@svana.org

about 14 years ago

In reply to: Greg Smith (#2)

Re: Patch: add timing of buffer I/O requests

On Sun, Nov 27, 2011 at 11:54:38PM -0800, Greg Smith wrote:

On 11/27/2011 04:39 PM, Ants Aasma wrote:

On the AMD I saw about 3% performance drop with timing enabled. On the
Intel machine I couldn't measure any statistically significant change.

Oh no, it's party pooper time again. Sorry I have to be the one to
do it this round. The real problem with this whole area is that we
know there are systems floating around where the amount of time
taken to grab timestamps like this is just terrible. I've been
annoyed enough by that problem to spend some time digging into why
that is--seems to be a bunch of trivia around the multiple ways to
collect time info on x86 systems--and after this CommitFest is over

Something good to know: in Linux the file
/sys/devices/system/clocksource/clocksource0/current_clocksource
lists the current clock source, and
/sys/devices/system/clocksource/clocksource0/available_clocksource
lists the available clock sources. With cat you can switch them. That
way you may be able to quantify the effects on a single machine.

Learned the hard way while tracking clock-skew on a multicore system.
The hpet may not be the fastest (that would be the cpu timer), but it's
the fastest (IME) that gives guarenteed monotonic time.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

Dimitri Fontaine

dimitri@2ndQuadrant.fr

about 14 years ago

In reply to: Tomas Vondra (#6)

Re: Patch: add timing of buffer I/O requests

"Tomas Vondra" <tv@fuzzy.cz> writes:

Another option would be to reimplement the vsyscall, even on platforms
that don't provide it. The principle is actually quite simple - allocate a
shared memory, store there a current time and update it whenever a clock
interrupt happens. This is basically what Greg suggested in one of the
previous posts, where "regularly" means "on every interrupt". Greg was
worried about the precision, but this should be just fine I guess. It's
the precision you get on Linux, anyway ...

That sounds good for other interesting things, which entails being able
to have timing information attached to the XID sequence. If we go this
way, how far are we from having a ticker in PostgreSQL?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#10

Jim Nasby

jim@nasby.net

about 14 years ago

In reply to: Tomas Vondra (#6)

Re: Patch: add timing of buffer I/O requests

On Nov 28, 2011, at 9:29 AM, Tomas Vondra wrote:

I recall a patch similar to this one was submitted by Greg Stark some

time ago. It used the info for different reasons--to try and figure out
whether reads were cached or not--but I believe it withered rather than
being implemented mainly because it ran into the same fundamental
roadblocks here. My memory could be wrong here, there were also concerns
about what the data would be used for.

The difficulty when distinguishing whether the reads were cached or not is
the price we pay for using filesystem cache instead of managing our own.
Not sure if this can be solved just by measuring the latency - with
spinners it's quite easy, the differences are rather huge (and it's not
difficult to derive that even from pgbench log). But with SSDs, multiple
tablespaces on different storage, etc. it gets much harder.

True, but every use case for this information I can think of ultimately only cares about how long it took to perform some kind of IO; it doesn't *really* care about whether it was cached. So in that context, we don't really care if SSDs are fast enough that they look like cache, because that means they're performing (essentially) the same as cache.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#11

Tomas Vondra

tv@fuzzy.cz

about 14 years ago

In reply to: Dimitri Fontaine (#9)

Re: Patch: add timing of buffer I/O requests

On 28.11.2011 22:32, Dimitri Fontaine wrote:

"Tomas Vondra" <tv@fuzzy.cz> writes:

Another option would be to reimplement the vsyscall, even on platforms
that don't provide it. The principle is actually quite simple - allocate a
shared memory, store there a current time and update it whenever a clock
interrupt happens. This is basically what Greg suggested in one of the
previous posts, where "regularly" means "on every interrupt". Greg was
worried about the precision, but this should be just fine I guess. It's
the precision you get on Linux, anyway ...

That sounds good for other interesting things, which entails being able
to have timing information attached to the XID sequence. If we go this
way, how far are we from having a ticker in PostgreSQL?

I'm not sure. On Linux/x86 this is already done, but my knowledge of
kernel development is rather limited, especially when it comes to other
OSes and platforms. E.g. I'm not sure why it's not available in FreeBSD
on x86, I guess it's rather "we don't want it" than "it's not possible."

In Linux sources, the most interesting pieces are probably these:

1) arch/x86/include/asm/vgtod.h - that's the shared memory structure

2) arch/x86/kernel/vsyscall_64.c - this is how the memory is read
(do_vgettimeofday)

3) arch/x86/kernel/vsyscall_64.c - this is how the memory is updated
(update_vsyscall)

4) kernel/time/timekeeping.c - do_settimeofday (calls update_vsyscall)

5) drivers/rtc/class.c (and other) RTC drivers call do_settimeofday

Tomas

#12

Tomas Vondra

tv@fuzzy.cz

about 14 years ago

In reply to: Jim Nasby (#10)

Re: Patch: add timing of buffer I/O requests

On 29.11.2011 02:14, Jim Nasby wrote:

On Nov 28, 2011, at 9:29 AM, Tomas Vondra wrote:

I recall a patch similar to this one was submitted by Greg
Stark some

time ago. It used the info for different reasons--to try and
figure out whether reads were cached or not--but I believe it
withered rather than being implemented mainly because it ran into
the same fundamental roadblocks here. My memory could be wrong
here, there were also concerns about what the data would be used
for.

The difficulty when distinguishing whether the reads were cached or
not is the price we pay for using filesystem cache instead of
managing our own. Not sure if this can be solved just by measuring
the latency - with spinners it's quite easy, the differences are
rather huge (and it's not difficult to derive that even from
pgbench log). But with SSDs, multiple tablespaces on different
storage, etc. it gets much harder.

True, but every use case for this information I can think of
ultimately only cares about how long it took to perform some kind of
IO; it doesn't *really* care about whether it was cached. So in that
context, we don't really care if SSDs are fast enough that they look
like cache, because that means they're performing (essentially) the
same as cache.

Yup, that's right. The wait times are generally much more interesting
than the cached/not cached ratio.

Tomas

#13

Tom Lane

tgl@sss.pgh.pa.us

about 14 years ago

In reply to: Dimitri Fontaine (#9)

Re: Patch: add timing of buffer I/O requests

Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes:

That sounds good for other interesting things, which entails being able
to have timing information attached to the XID sequence. If we go this
way, how far are we from having a ticker in PostgreSQL?

Those of us who are trying to get rid of idle-state process wakeups will
protest any such thing with vigor.

regards, tom lane

#14

Ants Aasma

ants.aasma@eesti.ee

about 14 years ago

In reply to: Tom Lane (#13)

Re: Patch: add timing of buffer I/O requests

Sorry for taking so long to respong, had a pretty busy day at work. Anyway..

On Mon, Nov 28, 2011 at 9:54 AM, Greg Smith <greg@2ndquadrant.com> wrote:

Oh no, it's party pooper time again. Sorry I have to be the one to do it
this round. The real problem with this whole area is that we know there are
systems floating around where the amount of time taken to grab timestamps
like this is just terrible. I've been annoyed enough by that problem to
spend some time digging into why that is--seems to be a bunch of trivia
around the multiple ways to collect time info on x86 systems--and after this
CommitFest is over I was already hoping to dig through my notes and start
quantifying that more. So you can't really prove the overhead of this
approach is acceptable just by showing two examples; we need to find one of
the really terrible clocks and test there to get a real feel for the
worst-case.

Sure, I know that the timing calls might be awfully slow. That's why I turned
it off by default. I saw that track_functions was already using this, so I
figured it was ok to have it potentially run very slowly.

-Document the underlying problem and known workarounds, provide a way to
test how bad the overhead is, and just throw our hands up and say "sorry,
you just can't instrument like this" if someone has a slow system.

Some documentation about potential problems would definitely be good.
Same goes for a test tool. ISTM that fast accurate timing is just not
possible on all supported platforms. That doesn't seem like a good enough
justification to refuse implementing something useful for the majority that
do as long as it doesn't cause regressions for those that don't or
significant code complexity.

-Have one of the PostgreSQL background processes keep track of a time
estimate on its own, only periodically pausing to sync against the real
time. Then most calls to gettimeofday() can use that value instead. I was
thinking of that idea for slightly longer running things though; I doubt
that can be made accurate enough to test instrument buffer

This would limit it to those cases where hundreds of milliseconds of jitter
or more don't bother all that much.

And while I hate to kick off massive bike-shedding in your direction, I'm
also afraid this area--collecting stats about how long individual operations
take--will need a much wider ranging approach than just looking at the
buffer cache ones. If you step back and ask "what do people expect here?",
there's a pretty large number who really want something like Oracle's
v$session_wait and v$system_event interface for finding the underlying
source of slow things. There's enough demand for that that EnterpriseDB has
even done some work in this area too; what I've been told about it suggests
the code isn't a great fit for contribution to community PostgreSQL though.
Like I said, this area is really messy and hard to get right.

Yeah, something like that should probably be something to strive for. I'll
ponder a bit more about resource and latency tracking a general. Maybe the
question here should be about the cost/benefit ratio of having some utility
now vs maintaining/deprecating the user visible interface when a more
general framework will turn up.

Something more ambitious like the v$ stuff would also take care of what
you're doing here; I'm not sure that what you've done helps built it though.
Please don't take that personally. Part of one of my own instrumentation
patches recently was rejected out of hand for the same reason, just not
being general enough.

No problem, I understand that half-way solutions can be more trouble than
they're worth. I actually built this to help with performance testing an
application and thought it would be an interesting experience to try to
give the community back something.

On Mon, Nov 28, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote:

I believe on most systems on modern linux kernels gettimeofday an its ilk
will be a vsyscall and nearly as fast as a regular function call.

clock_gettime() is implemented as a vDSO since 2.6.23. gettimeofday() has
been user context callable since before git shows any history (2.6.12).

On Mon, Nov 28, 2011 at 5:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The other big problem for a patch of this sort is that it would bloat
the stats file.

Yes. Which begs the question of why we need to measure this per-table.
I would think per-tablespace would be sufficient.

Yeah, I figured that this is something that should be discussed. I
implemented per-table collection because I thought it might be useful for
tools to pick up and show a quick overview on which tables are causing the
most IO overhead for queries.

On Mon, Nov 28, 2011 at 8:10 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:

Something good to know: in Linux the file
/sys/devices/system/clocksource/clocksource0/current_clocksource
lists the current clock source, and
/sys/devices/system/clocksource/clocksource0/available_clocksource
lists the available clock sources. With cat you can switch them. That
way you may be able to quantify the effects on a single machine.

Learned the hard way while tracking clock-skew on a multicore system.
The hpet may not be the fastest (that would be the cpu timer), but it's
the fastest (IME) that gives guarenteed monotonic time.

The Linux kernel seems to go pretty far out of its way to ensure that TSC
(CPU timestamp counter) based clocksource returns monotonic values,
including actually testing if it does. [1]https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc_sync.c#L143 If the hardware doesn't support
stable and consistent tsc values, tsc isn't used as a clock source.

Of course trying to keep it monotonic doesn't mean succeeding. I thought
about inserting a sanity check. But as the current instrumentation doesn't
use one and it would catch errors only in one direction, biasing the long
term average, I decided against it.

Because this is non-essential instrumentation, I don't see an issue with
it returning bogus information when the system clock is broken. Atleast it
seems that no one has complained about the same issue in track_functions.
The only complaint I found is that it's off by default.

On Mon, Nov 28, 2011 at 5:29 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Another option would be to reimplement the vsyscall, even on platforms
that don't provide it. The principle is actually quite simple - allocate a
shared memory, store there a current time and update it whenever a clock
interrupt happens. This is basically what Greg suggested in one of the
previous posts, where "regularly" means "on every interrupt". Greg was
worried about the precision, but this should be just fine I guess. It's
the precision you get on Linux, anyway ...

On modern platforms you actually really do get the microsecond precision.
Even more, if you use clock_gettime(CLOCK_MONOTONIC), you get nanosecond
precision and avoid issues with someone changing the system time while
you're timing. This precision does require OS and hardware cooperation,
because of CPU offsets, TSC's changing frequencies, stopping, etc.

--
Ants Aasma

[1]: https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc_sync.c#L143

#15

Greg Smith

greg@2ndQuadrant.com

about 14 years ago

In reply to: Robert Haas (#4)

Re: Patch: add timing of buffer I/O requests

On 11/28/2011 05:51 AM, Robert Haas wrote:

On Mon, Nov 28, 2011 at 2:54 AM, Greg Smith<greg@2ndquadrant.com> wrote:

The real problem with this whole area is that we know there are
systems floating around where the amount of time taken to grab timestamps
like this is just terrible.

Assuming the feature is off by default (and I can't imagine we'd
consider anything else), I don't see why that should be cause for
concern. If the instrumentation creates too much system load, then
don't use it: simple as that.

It's not quite that simple though. Releasing a performance measurement
feature that itself can perform terribly under undocumented conditions
has a wider downside than that.

Consider that people aren't going to turn it on until they are already
overloaded. If that has the potential to completely tank performance,
we better make sure that area is at least explored usefully first; the
minimum diligence should be to document that fact and make suggestions
for avoiding or testing it.

Instrumentation that can itself become a performance problem is an
advocacy problem waiting to happen. As I write this I'm picturing such
an encounter resulting in an angry blog post, about how this proves
PostgreSQL isn't usable for serious systems because someone sees massive
overhead turning this on. Right now the primary exposure to this class
of issue is EXPLAIN ANALYZE. When I was working on my book, I went out
of my way to find a worst case for that[1](Dell Store 2 schema, query was "SELECT count(*) FROM customers;"), and that turned out to be a
query that went from 7.994ms to 69.837ms when instrumented. I've been
meaning to investigate what was up there since finding that one. The
fact that we already have one such problem bit exposed already worries
me; I'd really prefer not to have two.

[1]: (Dell Store 2 schema, query was "SELECT count(*) FROM customers;")

#16

Tom Lane

tgl@sss.pgh.pa.us

about 14 years ago

In reply to: Greg Smith (#15)

Re: Patch: add timing of buffer I/O requests

Greg Smith <greg@2ndQuadrant.com> writes:

On 11/28/2011 05:51 AM, Robert Haas wrote:

Assuming the feature is off by default (and I can't imagine we'd
consider anything else), I don't see why that should be cause for
concern. If the instrumentation creates too much system load, then
don't use it: simple as that.

It's not quite that simple though. Releasing a performance measurement
feature that itself can perform terribly under undocumented conditions
has a wider downside than that.

Yeah, that's a good point, and the machines on which this would suck
are exactly the ones where EXPLAIN ANALYZE creates very large overhead.
We don't seem to see a lot of complaints about that anymore, but we do
still see some ... and yes, it's documented that EXPLAIN ANALYZE can add
significant overhead, but that doesn't stop the questions.

Instrumentation that can itself become a performance problem is an
advocacy problem waiting to happen. As I write this I'm picturing such
an encounter resulting in an angry blog post, about how this proves
PostgreSQL isn't usable for serious systems because someone sees massive
overhead turning this on.

Of course, the rejoinder could be that if you see that, you're not
testing on serious hardware. But still, I take your point.

Right now the primary exposure to this class
of issue is EXPLAIN ANALYZE. When I was working on my book, I went out
of my way to find a worst case for that[1],
[1] (Dell Store 2 schema, query was "SELECT count(*) FROM customers;")

That's pretty meaningless without saying what sort of clock hardware
was on the machine...

regards, tom lane

#17

Ants Aasma

ants.aasma@eesti.ee

almost 14 years ago

In reply to: Ants Aasma (#1)

1 attachment(s)

Re: Patch: add timing of buffer I/O requests

Here's the second version of the I/O timings patch. Changes from the
previous version:

* Rebased against master.
* Added the missing pg_stat_statements upgrade script that I
accidentally left out from the previous version.
* Added a tool to test timing overhead under contrib/pg_test_timing

I hope that having a tool to measure the overhead and check the sanity
of clock sources is enough to answer the worries about the potential
performance hit. We could also check that the clock source is fast
enough on start-up/when the guc is changed, but that seems a bit too
much and leaves open the question about what is fast enough.

About issues with stats file bloat - if it really is a blocker, I can
easily rip out the per-table or even per-database stats fields. The
patch is plenty useful without them. It seemed like a useful tool for
overworked DBAs with limited amount of SSD space available to easily
figure out which tables and indexes would benefit most from fast
storage.

--
Ants Aasma

Attachments:

io-stats.v2.patchtext/x-patch; charset=US-ASCII; name=io-stats.v2.patchDownload

diff --git a/contrib/Makefile b/contrib/Makefile
index 0c238aa..45b601c 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
 		pg_standby	\
 		pg_stat_statements \
 		pg_test_fsync	\
+		pg_test_timing	\
 		pg_trgm		\
 		pg_upgrade	\
 		pg_upgrade_support \
diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index e086fd8..971773e 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -4,7 +4,8 @@ MODULE_big = pg_stat_statements
 OBJS = pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.0.sql pg_stat_statements--unpackaged--1.0.sql
+DATA = pg_stat_statements--1.1.sql pg_stat_statements--1.0--1.1.sql \
+       pg_stat_statements--unpackaged--1.0.sql
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
new file mode 100644
index 0000000..20bd5e3
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
@@ -0,0 +1,26 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.1'" to load this file. \quit
+
+CREATE OR REPLACE FUNCTION pg_stat_statements(
+    OUT userid oid,
+    OUT dbid oid,
+    OUT query text,
+    OUT calls int8,
+    OUT total_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT time_read float8,
+    OUT time_write float8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.0.sql b/contrib/pg_stat_statements/pg_stat_statements--1.0.sql
deleted file mode 100644
index 5294a01..0000000
--- a/contrib/pg_stat_statements/pg_stat_statements--1.0.sql
+++ /dev/null
@@ -1,39 +0,0 @@
-/* contrib/pg_stat_statements/pg_stat_statements--1.0.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit
-
--- Register functions.
-CREATE FUNCTION pg_stat_statements_reset()
-RETURNS void
-AS 'MODULE_PATHNAME'
-LANGUAGE C;
-
-CREATE FUNCTION pg_stat_statements(
-    OUT userid oid,
-    OUT dbid oid,
-    OUT query text,
-    OUT calls int8,
-    OUT total_time float8,
-    OUT rows int8,
-    OUT shared_blks_hit int8,
-    OUT shared_blks_read int8,
-    OUT shared_blks_written int8,
-    OUT local_blks_hit int8,
-    OUT local_blks_read int8,
-    OUT local_blks_written int8,
-    OUT temp_blks_read int8,
-    OUT temp_blks_written int8
-)
-RETURNS SETOF record
-AS 'MODULE_PATHNAME'
-LANGUAGE C;
-
--- Register a view on the function for ease of use.
-CREATE VIEW pg_stat_statements AS
-  SELECT * FROM pg_stat_statements();
-
-GRANT SELECT ON pg_stat_statements TO PUBLIC;
-
--- Don't want this to be available to non-superusers.
-REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
new file mode 100644
index 0000000..8bd2868
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
@@ -0,0 +1,41 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit
+
+-- Register functions.
+CREATE FUNCTION pg_stat_statements_reset()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION pg_stat_statements(
+    OUT userid oid,
+    OUT dbid oid,
+    OUT query text,
+    OUT calls int8,
+    OUT total_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT time_read float8,
+    OUT time_write float8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+-- Register a view on the function for ease of use.
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements();
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
+
+-- Don't want this to be available to non-superusers.
+REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 434aa71..cdcf4e8 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -83,6 +83,8 @@ typedef struct Counters
 	int64		local_blks_written;		/* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;		/* # of temp blocks written */
+	double		time_read;		/* time spent reading in seconds */
+	double		time_write;		/* time spent writing in seconds */
 	double		usage;			/* usage factor */
 } Counters;
 
@@ -616,9 +618,9 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString,
 		instr_time	start;
 		instr_time	duration;
 		uint64		rows = 0;
-		BufferUsage bufusage;
+		BufferUsage bufusage_start, bufusage;
 
-		bufusage = pgBufferUsage;
+		bufusage_start = pgBufferUsage;
 		INSTR_TIME_SET_CURRENT(start);
 
 		nested_level++;
@@ -649,21 +651,25 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString,
 
 		/* calc differences of buffer counters. */
 		bufusage.shared_blks_hit =
-			pgBufferUsage.shared_blks_hit - bufusage.shared_blks_hit;
+			pgBufferUsage.shared_blks_hit - bufusage_start.shared_blks_hit;
 		bufusage.shared_blks_read =
-			pgBufferUsage.shared_blks_read - bufusage.shared_blks_read;
+			pgBufferUsage.shared_blks_read - bufusage_start.shared_blks_read;
 		bufusage.shared_blks_written =
-			pgBufferUsage.shared_blks_written - bufusage.shared_blks_written;
+			pgBufferUsage.shared_blks_written - bufusage_start.shared_blks_written;
 		bufusage.local_blks_hit =
-			pgBufferUsage.local_blks_hit - bufusage.local_blks_hit;
+			pgBufferUsage.local_blks_hit - bufusage_start.local_blks_hit;
 		bufusage.local_blks_read =
-			pgBufferUsage.local_blks_read - bufusage.local_blks_read;
+			pgBufferUsage.local_blks_read - bufusage_start.local_blks_read;
 		bufusage.local_blks_written =
-			pgBufferUsage.local_blks_written - bufusage.local_blks_written;
+			pgBufferUsage.local_blks_written - bufusage_start.local_blks_written;
 		bufusage.temp_blks_read =
-			pgBufferUsage.temp_blks_read - bufusage.temp_blks_read;
+			pgBufferUsage.temp_blks_read - bufusage_start.temp_blks_read;
 		bufusage.temp_blks_written =
-			pgBufferUsage.temp_blks_written - bufusage.temp_blks_written;
+			pgBufferUsage.temp_blks_written - bufusage_start.temp_blks_written;
+		bufusage.time_read = pgBufferUsage.time_read;
+		INSTR_TIME_SUBTRACT(bufusage.time_read, bufusage_start.time_read);
+		bufusage.time_write = pgBufferUsage.time_write;
+		INSTR_TIME_SUBTRACT(bufusage.time_write, bufusage_start.time_write);
 
 		pgss_store(queryString, INSTR_TIME_GET_DOUBLE(duration), rows,
 				   &bufusage);
@@ -772,6 +778,8 @@ pgss_store(const char *query, double total_time, uint64 rows,
 		e->counters.local_blks_written += bufusage->local_blks_written;
 		e->counters.temp_blks_read += bufusage->temp_blks_read;
 		e->counters.temp_blks_written += bufusage->temp_blks_written;
+		e->counters.time_read +=  INSTR_TIME_GET_DOUBLE(bufusage->time_read);
+		e->counters.time_write += INSTR_TIME_GET_DOUBLE(bufusage->time_write);
 		e->counters.usage += usage;
 		SpinLockRelease(&e->mutex);
 	}
@@ -793,7 +801,7 @@ pg_stat_statements_reset(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
-#define PG_STAT_STATEMENTS_COLS		14
+#define PG_STAT_STATEMENTS_COLS		16
 
 /*
  * Retrieve statement statistics.
@@ -893,6 +901,8 @@ pg_stat_statements(PG_FUNCTION_ARGS)
 		values[i++] = Int64GetDatumFast(tmp.local_blks_written);
 		values[i++] = Int64GetDatumFast(tmp.temp_blks_read);
 		values[i++] = Int64GetDatumFast(tmp.temp_blks_written);
+		values[i++] = Float8GetDatumFast(tmp.time_read);
+		values[i++] = Float8GetDatumFast(tmp.time_write);
 
 		Assert(i == PG_STAT_STATEMENTS_COLS);
 
diff --git a/contrib/pg_stat_statements/pg_stat_statements.control b/contrib/pg_stat_statements/pg_stat_statements.control
index 6f9a947..428fbb2 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.control
+++ b/contrib/pg_stat_statements/pg_stat_statements.control
@@ -1,5 +1,5 @@
 # pg_stat_statements extension
 comment = 'track execution statistics of all SQL statements executed'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/pg_stat_statements'
 relocatable = true
diff --git a/contrib/pg_test_timing/Makefile b/contrib/pg_test_timing/Makefile
new file mode 100644
index 0000000..b8b266a
--- /dev/null
+++ b/contrib/pg_test_timing/Makefile
@@ -0,0 +1,18 @@
+# contrib/pg_test_timing/Makefile
+
+PGFILEDESC = "pg_test_timing - test timing overhead"
+PGAPPICON = win32
+
+PROGRAM  = pg_test_timing
+OBJS = pg_test_timing.o
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_test_timing
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_test_timing/pg_test_timing.c b/contrib/pg_test_timing/pg_test_timing.c
new file mode 100644
index 0000000..be511ce
--- /dev/null
+++ b/contrib/pg_test_timing/pg_test_timing.c
@@ -0,0 +1,149 @@
+/*
+ *	pg_test_timing.c
+ *		tests overhead and monotonicity of timing calls
+ */
+
+#include "postgres_fe.h"
+
+#include "getopt_long.h"
+#include "portability/instr_time.h"
+
+static const char *progname;
+
+static int32	test_duration = 3;
+
+static void handle_args(int argc, char *argv[]);
+static void test_timing(int32);
+
+int
+main(int argc, char *argv[])
+{
+	progname = get_progname(argv[0]);
+
+	handle_args(argc, argv);
+
+	test_timing(test_duration);
+
+	return 0;
+}
+
+static void
+handle_args(int argc, char *argv[])
+{
+	static struct option long_options[] = {
+		{"duration", required_argument, NULL, 'd'},
+		{NULL, 0, NULL, 0}
+	};
+	int			option;			/* Command line option */
+	int			optindex = 0;	/* used by getopt_long */
+
+	if (argc > 1)
+	{
+		if (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-h") == 0 ||
+			strcmp(argv[1], "-?") == 0)
+		{
+			printf("Usage: %s [-d DURATION]\n", progname);
+			exit(0);
+		}
+		if (strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-V") == 0)
+		{
+			puts("pg_test_timing (PostgreSQL) " PG_VERSION);
+			exit(0);
+		}
+	}
+
+	while ((option = getopt_long(argc, argv, "d:",
+								 long_options, &optindex)) != -1)
+	{
+		switch (option)
+		{
+			case 'd':
+				test_duration = atoi(optarg);
+				break;
+
+			default:
+				fprintf(stderr, "Try \"%s --help\" for more information.\n",
+						progname);
+				exit(1);
+				break;
+		}
+	}
+
+	if (argc > optind)
+	{
+		fprintf(stderr,
+				"%s: too many command-line arguments (first is \"%s\")\n",
+				progname, argv[optind]);
+		fprintf(stderr, "Try \"%s --help\" for more information.\n",
+				progname);
+		exit(1);
+	}
+
+	if (test_duration > 0) {
+		printf("Testing timing overhead for %d seconds.\n", test_duration);
+	} else {
+		printf("Testing timing monotonicity until interrupted.\n");
+	}
+}
+
+static void
+test_timing(int32 duration)
+{
+	uint64 total_time;
+	int64 time_elapsed = 0;
+	uint64 loop_count = 0;
+	uint64 prev, cur;
+	int32 diff, i, bits, found;
+
+	instr_time start_time, end_time, temp;
+
+	static int64 histogram[32];
+
+	total_time = duration > 0 ? duration * 1000000 : 0;
+
+	INSTR_TIME_SET_CURRENT(start_time);
+	cur = INSTR_TIME_GET_MICROSEC(start_time);
+
+	while (time_elapsed < total_time)
+	{
+		prev = cur;
+		INSTR_TIME_SET_CURRENT(temp);
+		cur = INSTR_TIME_GET_MICROSEC(temp);
+		diff = cur - prev;
+
+		if (diff < 0) {
+			printf("Detected clock going backwards in time.\n");
+			printf("Time warp: %d microseconds\n", diff);
+			exit(1);
+		}
+
+		bits = 0;
+		while (diff) {
+			diff >>= 1;
+			bits++;
+		}
+		histogram[bits]++;
+
+		loop_count++;
+		INSTR_TIME_SUBTRACT(temp, start_time);
+		time_elapsed = INSTR_TIME_GET_MICROSEC(temp);
+	}
+
+	INSTR_TIME_SET_CURRENT(end_time);
+
+	INSTR_TIME_SUBTRACT(end_time, start_time);
+
+	printf("Per timing duration including loop overhead: %0.2f ns\n",
+			INSTR_TIME_GET_DOUBLE(end_time) * 1e9 / loop_count);
+	printf("Histogram of timing durations:\n");
+	printf("%9s: %10s %9s\n", "< usec", "count", "percent");
+
+	found = 0;
+    for (i = 31; i >= 0; i--) {
+        if (found || histogram[i]) {
+            found = 1;
+            printf("%9ld: %10ld %8.5f%%\n", 1l<<i, histogram[i],
+                ((double)histogram[i])*100/loop_count);
+        }
+    }
+}
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0cc3296..b7084d5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4174,6 +4174,25 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-iotiming" xreflabel="track_iotiming">
+      <term><varname>track_iotiming</varname> (<type>boolean</type>)</term>
+      <indexterm>
+       <primary><varname>track_iotiming</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Enables timing of database I/O calls.
+        This parameter is off by default, because it may cause significant
+        overhead if the platform doesn't support fast timing information.
+        Only superusers can change this setting.
+       </para>
+       <para>
+        You can use the <xref linkend="pgtesttiming"> tool to find out the
+        overhead of timing on your system.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)</term>
       <indexterm>
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index adf09ca..b418688 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -121,6 +121,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  &pgstatstatements;
  &pgstattuple;
  &pgtestfsync;
+ &pgtesttiming;
  &pgtrgm;
  &pgupgrade;
  &seg;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index b96dd65..38c9334 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -129,6 +129,7 @@
 <!ENTITY pgstatstatements SYSTEM "pgstatstatements.sgml">
 <!ENTITY pgstattuple     SYSTEM "pgstattuple.sgml">
 <!ENTITY pgtestfsync     SYSTEM "pgtestfsync.sgml">
+<!ENTITY pgtesttiming    SYSTEM "pgtesttiming.sgml">
 <!ENTITY pgtrgm          SYSTEM "pgtrgm.sgml">
 <!ENTITY pgupgrade       SYSTEM "pgupgrade.sgml">
 <!ENTITY seg             SYSTEM "seg.sgml">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a12a9a2..a3c6a80 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -116,7 +116,8 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
    <productname>PostgreSQL</productname>'s <firstterm>statistics collector</>
    is a subsystem that supports collection and reporting of information about
    server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
+   and indexes in both disk-block and individual-row terms and time disk-block
+   accesses.  It also tracks
    the total number of rows in each table, and information about vacuum and
    analyze actions for each table.  It can also count calls to user-defined
    functions and the total time spent in each one.
@@ -145,6 +146,11 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
   </para>
 
   <para>
+   The parameter <xref linkend="guc-track-iotiming"> enables timing of I/O
+   requests.
+  </para>
+
+  <para>
    The parameter <xref linkend="guc-track-functions"> enables tracking of
    usage of user-defined functions.
   </para>
@@ -402,8 +408,9 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
       blocks read from that table, number of buffer hits, numbers of
       disk blocks read and buffer hits in all indexes of that table,
       numbers of disk blocks read and buffer hits from that table's
-      auxiliary TOAST table (if any), and numbers of disk blocks read
-      and buffer hits for the TOAST table's index.
+      auxiliary TOAST table (if any), numbers of disk blocks read
+      and buffer hits for the TOAST table's index and microseconds
+      spent reading the blocks for each category.
       </entry>
      </row>
 
@@ -423,7 +430,8 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
       <entry><structname>pg_statio_all_indexes</><indexterm><primary>pg_statio_all_indexes</primary></indexterm></entry>
       <entry>For each index in the current database,
       the table and index OID, schema, table and index name,
-      numbers of disk blocks read and buffer hits in that index.
+      numbers of disk blocks read, microseconds spent reading the blocks
+      and buffer hits in that index.
       </entry>
      </row>
 
@@ -522,7 +530,10 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
    handles disk I/O, data that is not in the
    <productname>PostgreSQL</> buffer cache might still reside in the
    kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
+   requiring a physical read.
+   Timing info shows how much user queries were delayed by buffer reads
+   in aggregate.
+   Users interested in obtaining more
    detailed information on <productname>PostgreSQL</> I/O behavior are
    advised to use the <productname>PostgreSQL</> statistics collector
    in combination with operating system utilities that allow insight
@@ -598,6 +609,15 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
      </row>
 
      <row>
+      <entry><literal><function>pg_stat_get_db_blocks_time</function>(<type>oid</type>)</literal></entry>
+      <entry><type>bigint</type></entry>
+      <entry>
+       Microseconds spent across all backends waiting for disk block fetch
+       requests for database
+      </entry>
+     </row>
+
+     <row>
       <entry><literal><function>pg_stat_get_db_tuples_returned</function>(<type>oid</type>)</literal></entry>
       <entry><type>bigint</type></entry>
       <entry>
@@ -783,6 +803,15 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
      </row>
 
      <row>
+      <entry><literal><function>pg_stat_get_blocks_time</function>(<type>oid</type>)</literal></entry>
+      <entry><type>bigint</type></entry>
+      <entry>
+       Microseconds spent across all backends waiting for disk block requests
+       for table or index
+      </entry>
+     </row>
+
+     <row>
       <entry><literal><function>pg_stat_get_last_vacuum_time</function>(<type>oid</type>)</literal></entry>
       <entry><type>timestamptz</type></entry>
       <entry>
@@ -924,6 +953,14 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
      </row>
 
      <row>
+      <entry><literal><function>pg_stat_get_xact_blocks_time</function>(<type>oid</type>)</literal></entry>
+      <entry><type>bigint</type></entry>
+      <entry>
+       Microseconds spent waiting for  disk block requests for table or index, in the current transaction
+      </entry>
+     </row>
+
+     <row>
        <!-- See also the entry for this in func.sgml -->
       <entry><literal><function>pg_backend_pid()</function></literal></entry>
       <entry><type>integer</type></entry>
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 5a0230c..2d60bdb 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -141,6 +141,20 @@
       <entry>Total number of temp blocks writes by the statement</entry>
      </row>
 
+     <row>
+      <entry><structfield>time_read</structfield></entry>
+      <entry><type>double precision</type></entry>
+      <entry></entry>
+      <entry>Total time spent by the statement for reading blocks, in seconds</entry>
+     </row>
+
+     <row>
+      <entry><structfield>time_write</structfield></entry>
+      <entry><type>double precision</type></entry>
+      <entry></entry>
+      <entry>Total time spent by the statement for writing out dirty blocks, in seconds</entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/pgtesttiming.sgml b/doc/src/sgml/pgtesttiming.sgml
new file mode 100644
index 0000000..697e215
--- /dev/null
+++ b/doc/src/sgml/pgtesttiming.sgml
@@ -0,0 +1,54 @@
+<!-- doc/src/sgml/pgtesttiming.sgml -->
+
+<sect1 id="pgtesttiming" xreflabel="pg_test_timing">
+ <title>pg_test_timing</title>
+
+ <indexterm zone="pgtesttiming">
+  <primary>pg_test_timing</primary>
+ </indexterm>
+
+ <para>
+  <application>pg_test_timing</> is a tool to measure the timing overhead
+  on your system and check that the system time never moves backwards.
+ </para>
+
+ <sect2>
+  <title>Usage</title>
+
+<synopsis>
+pg_test_timing [options]
+</synopsis>
+
+   <para>
+    <application>pg_test_timing</application> accepts the following
+    command-line options:
+
+    <variablelist>
+
+     <varlistentry>
+      <term><option>-d</option></term>
+      <term><option>--duration</option></term>
+      <listitem>
+       <para>
+        Specifies the number of seconds to run the test. Longer durations
+        give slightly better accuracy and have a higher probablity to
+        discover potential problems with the system clock. The default
+        test duration is 3 seconds.
+       </para>
+      </listitem>
+     </varlistentry>
+
+    </variablelist>
+   </para>
+
+ </sect2>
+
+ <sect2>
+  <title>Author</title>
+
+  <para>
+   Ants Aasma <email>ants.aasma@eesti.ee</email>
+  </para>
+ </sect2>
+
+</sect1>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 50ba20c..5a7c91b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -415,15 +415,19 @@ CREATE VIEW pg_statio_all_tables AS
             pg_stat_get_blocks_fetched(C.oid) -
                     pg_stat_get_blocks_hit(C.oid) AS heap_blks_read,
             pg_stat_get_blocks_hit(C.oid) AS heap_blks_hit,
+            pg_stat_get_blocks_time(C.oid) AS heap_blks_time,
             sum(pg_stat_get_blocks_fetched(I.indexrelid) -
                     pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_read,
             sum(pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_hit,
+            sum(pg_stat_get_blocks_time(I.indexrelid))::bigint AS idx_blks_time,
             pg_stat_get_blocks_fetched(T.oid) -
                     pg_stat_get_blocks_hit(T.oid) AS toast_blks_read,
             pg_stat_get_blocks_hit(T.oid) AS toast_blks_hit,
+            pg_stat_get_blocks_time(T.oid) AS toast_blks_time,
             pg_stat_get_blocks_fetched(X.oid) -
                     pg_stat_get_blocks_hit(X.oid) AS tidx_blks_read,
-            pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit
+            pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit,
+            pg_stat_get_blocks_time(X.oid) AS tidx_blks_time
     FROM pg_class C LEFT JOIN
             pg_index I ON C.oid = I.indrelid LEFT JOIN
             pg_class T ON C.reltoastrelid = T.oid LEFT JOIN
@@ -477,7 +481,8 @@ CREATE VIEW pg_statio_all_indexes AS
             I.relname AS indexrelname,
             pg_stat_get_blocks_fetched(I.oid) -
                     pg_stat_get_blocks_hit(I.oid) AS idx_blks_read,
-            pg_stat_get_blocks_hit(I.oid) AS idx_blks_hit
+            pg_stat_get_blocks_hit(I.oid) AS idx_blks_hit,
+            pg_stat_get_blocks_time(I.oid) AS idx_blks_time
     FROM pg_class C JOIN
             pg_index X ON C.oid = X.indrelid JOIN
             pg_class I ON I.oid = X.indexrelid
@@ -501,7 +506,8 @@ CREATE VIEW pg_statio_all_sequences AS
             C.relname AS relname,
             pg_stat_get_blocks_fetched(C.oid) -
                     pg_stat_get_blocks_hit(C.oid) AS blks_read,
-            pg_stat_get_blocks_hit(C.oid) AS blks_hit
+            pg_stat_get_blocks_hit(C.oid) AS blks_hit,
+            pg_stat_get_blocks_time(C.oid) AS blks_time
     FROM pg_class C
             LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
     WHERE C.relkind = 'S';
@@ -568,6 +574,7 @@ CREATE VIEW pg_stat_database AS
             pg_stat_get_db_blocks_fetched(D.oid) -
                     pg_stat_get_db_blocks_hit(D.oid) AS blks_read,
             pg_stat_get_db_blocks_hit(D.oid) AS blks_hit,
+            pg_stat_get_db_blocks_time(D.oid) AS blks_time,
             pg_stat_get_db_tuples_returned(D.oid) AS tup_returned,
             pg_stat_get_db_tuples_fetched(D.oid) AS tup_fetched,
             pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8b48105..6a98820 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1150,9 +1150,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 									 usage->local_blks_written);
 			bool		has_temp = (usage->temp_blks_read > 0 ||
 									usage->temp_blks_written);
+			bool		has_timing = (!INSTR_TIME_IS_ZERO(usage->time_read) ||
+									  !INSTR_TIME_IS_ZERO(usage->time_write));
+
 
 			/* Show only positive counter values. */
-			if (has_shared || has_local || has_temp)
+			if (has_shared || has_local || has_temp || has_timing)
 			{
 				appendStringInfoSpaces(es->str, es->indent * 2);
 				appendStringInfoString(es->str, "Buffers:");
@@ -1197,6 +1200,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
 						appendStringInfo(es->str, " written=%ld",
 										 usage->temp_blks_written);
 				}
+				if (has_timing)
+				{
+					appendStringInfoString(es->str, " timing");
+					if (!INSTR_TIME_IS_ZERO(usage->time_read)) {
+						appendStringInfo(es->str, " read=%0.2f",
+										 INSTR_TIME_GET_MILLISEC(usage->time_read));
+					}
+					if (!INSTR_TIME_IS_ZERO(usage->time_write)) {
+						appendStringInfo(es->str, " write=%0.2f",
+										 INSTR_TIME_GET_MILLISEC(usage->time_write));
+					}
+				}
 				appendStringInfoChar(es->str, '\n');
 			}
 		}
@@ -1210,6 +1225,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
 			ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
 			ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
+			ExplainPropertyFloat("Read Waits", INSTR_TIME_GET_MILLISEC(usage->time_read), 3, es);
+			ExplainPropertyFloat("Write Waits", INSTR_TIME_GET_MILLISEC(usage->time_write), 3, es);
 		}
 	}
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index dde73b7..28c4b4c 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -135,4 +135,6 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	INSTR_TIME_ACCUM_DIFF(dst->time_read, add->time_read, sub->time_read);
+	INSTR_TIME_ACCUM_DIFF(dst->time_write, add->time_write, sub->time_write);
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 323d42b..1845cdc 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3255,6 +3255,7 @@ pgstat_get_db_entry(Oid databaseid, bool create)
 		result->n_xact_rollback = 0;
 		result->n_blocks_fetched = 0;
 		result->n_blocks_hit = 0;
+		INSTR_TIME_SET_ZERO(result->blocks_time);
 		result->n_tuples_returned = 0;
 		result->n_tuples_fetched = 0;
 		result->n_tuples_inserted = 0;
@@ -3326,6 +3327,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 		result->changes_since_analyze = 0;
 		result->blocks_fetched = 0;
 		result->blocks_hit = 0;
+		INSTR_TIME_SET_ZERO(result->blocks_time);
 		result->vacuum_timestamp = 0;
 		result->vacuum_count = 0;
 		result->autovac_vacuum_timestamp = 0;
@@ -4014,6 +4016,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
 			tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
+			tabentry->blocks_time = tabmsg->t_counts.t_blocks_time;
 
 			tabentry->vacuum_timestamp = 0;
 			tabentry->vacuum_count = 0;
@@ -4041,6 +4044,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
 			tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
+			INSTR_TIME_ADD(tabentry->blocks_time, tabmsg->t_counts.t_blocks_time);
 		}
 
 		/* Clamp n_live_tuples in case of negative delta_live_tuples */
@@ -4058,6 +4062,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 		dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
 		dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
 		dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
+		INSTR_TIME_ADD(dbentry->blocks_time, tabmsg->t_counts.t_blocks_time);
 	}
 }
 
@@ -4171,6 +4176,7 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
 	dbentry->n_xact_rollback = 0;
 	dbentry->n_blocks_fetched = 0;
 	dbentry->n_blocks_hit = 0;
+	INSTR_TIME_SET_ZERO(dbentry->blocks_time);
 	dbentry->n_tuples_returned = 0;
 	dbentry->n_tuples_fetched = 0;
 	dbentry->n_tuples_inserted = 0;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8f68bcc..de3a1f0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -67,6 +67,7 @@
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
+bool		track_iotiming = false;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -86,7 +87,7 @@ static volatile BufferDesc *PinCountWaitBuf = NULL;
 static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 				  ForkNumber forkNum, BlockNumber blockNum,
 				  ReadBufferMode mode, BufferAccessStrategy strategy,
-				  bool *hit);
+				  bool *hit, instr_time *io_time);
 static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
@@ -224,6 +225,7 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
 	bool		hit;
+	instr_time	io_time;
 	Buffer		buf;
 
 	/* Open it at the smgr level if not already done */
@@ -245,9 +247,11 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	pgstat_count_buffer_read(reln);
 	buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
+							forkNum, blockNum, mode, strategy, &hit, &io_time);
 	if (hit)
 		pgstat_count_buffer_hit(reln);
+	else
+		pgstat_count_buffer_time(reln, io_time);
 	return buf;
 }
 
@@ -267,11 +271,12 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BufferAccessStrategy strategy)
 {
 	bool		hit;
+	instr_time	io_time;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
 	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
-							 mode, strategy, &hit);
+							 mode, strategy, &hit, &io_time);
 }
 
 
@@ -279,19 +284,22 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
+ * If track_iotiming is enabled, *io_time is set to the time the read took.
  */
 static Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy, bool *hit, instr_time *io_time)
 {
 	volatile BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
+	instr_time io_start, io_end;
 
 	*hit = false;
+	INSTR_TIME_SET_ZERO(*io_time);
 
 	/* Make sure we will have room to remember the buffer pin */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -437,8 +445,18 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
+			if (track_iotiming)
+				INSTR_TIME_SET_CURRENT(io_start);
+
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			if (track_iotiming)
+			{
+				INSTR_TIME_SET_CURRENT(io_end);
+				INSTR_TIME_ACCUM_DIFF(*io_time, io_end, io_start);
+				INSTR_TIME_ADD(pgBufferUsage.time_read, *io_time);
+			}
+
 			/* check for garbage data */
 			if (!PageHeaderIsValid((PageHeader) bufBlock))
 			{
@@ -1860,6 +1878,7 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcontext;
+	instr_time io_start, io_end;
 
 	/*
 	 * Acquire the buffer's io_in_progress lock.  If StartBufferIO returns
@@ -1907,12 +1926,21 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 	buf->flags &= ~BM_JUST_DIRTIED;
 	UnlockBufHdr(buf);
 
+	if (track_iotiming)
+		INSTR_TIME_SET_CURRENT(io_start);
+
 	smgrwrite(reln,
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  (char *) BufHdrGetBlock(buf),
 			  false);
 
+	if (track_iotiming)
+	{
+		INSTR_TIME_SET_CURRENT(io_end);
+		INSTR_TIME_ACCUM_DIFF(pgBufferUsage.time_write, io_end, io_start);
+	}
+
 	pgBufferUsage.shared_blks_written++;
 
 	/*
@@ -2648,6 +2676,7 @@ WaitIO(volatile BufferDesc *buf)
 static bool
 StartBufferIO(volatile BufferDesc *buf, bool forInput)
 {
+	instr_time wait_start, wait_end;
 	Assert(!InProgressBuf);
 
 	for (;;)
@@ -2656,7 +2685,28 @@ StartBufferIO(volatile BufferDesc *buf, bool forInput)
 		 * Grab the io_in_progress lock so that other processes can wait for
 		 * me to finish the I/O.
 		 */
-		LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+		if (forInput && track_iotiming) {
+			/*
+			 * We need to time the lock wait to account for I/O waits where
+			 * someone else is doing the work for us. Conditional acquire
+			 * avoids double timing overhead when we do the I/O ourselves.
+			 */
+			if (!LWLockConditionalAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE)) {
+				INSTR_TIME_SET_CURRENT(wait_start);
+
+				LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+
+				/*
+				 * Only do backend local accounting, stats collector will get the
+				 * wait from the backend doing the I/O.
+				 */
+				INSTR_TIME_SET_CURRENT(wait_end);
+				INSTR_TIME_ACCUM_DIFF(pgBufferUsage.time_read, wait_end, wait_start);
+			}
+		} else {
+			LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+		}
+
 
 		LockBufHdr(buf);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b4986d8..b6e7339 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -22,6 +22,7 @@
 #include "utils/builtins.h"
 #include "utils/inet.h"
 #include "utils/timestamp.h"
+#include "portability/instr_time.h"
 
 /* bogus ... these externs should be in a header file */
 extern Datum pg_stat_get_numscans(PG_FUNCTION_ARGS);
@@ -35,6 +36,7 @@ extern Datum pg_stat_get_live_tuples(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_dead_tuples(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_blocks_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_blocks_hit(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_blocks_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_last_vacuum_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_last_autovacuum_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_last_analyze_time(PG_FUNCTION_ARGS);
@@ -67,6 +69,7 @@ extern Datum pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_db_blocks_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS);
@@ -99,6 +102,7 @@ extern Datum pg_stat_get_xact_tuples_deleted(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_blocks_hit(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_xact_blocks_time(PG_FUNCTION_ARGS);
 
 extern Datum pg_stat_get_xact_function_calls(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_function_time(PG_FUNCTION_ARGS);
@@ -289,6 +293,22 @@ pg_stat_get_blocks_hit(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_blocks_time(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	double		result;
+	PgStat_StatTabEntry *tabentry;
+
+	if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+		result = 0;
+	else
+		/* Cast overflows in about 300'000 years of io time */
+		result = (int64) INSTR_TIME_GET_MICROSEC(tabentry->blocks_time);
+
+	PG_RETURN_INT64(result);
+}
+
+Datum
 pg_stat_get_last_vacuum_time(PG_FUNCTION_ARGS)
 {
 	Oid			relid = PG_GETARG_OID(0);
@@ -1083,6 +1103,22 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
 
 
 Datum
+pg_stat_get_db_blocks_time(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) INSTR_TIME_GET_MICROSEC(dbentry->blocks_time);
+
+	PG_RETURN_INT64(result);
+}
+
+
+Datum
 pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
 {
 	Oid			dbid = PG_GETARG_OID(0);
@@ -1482,6 +1518,21 @@ pg_stat_get_xact_blocks_hit(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_xact_blocks_time(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_TableStatus *tabentry;
+
+	if ((tabentry = find_tabstat_entry(relid)) == NULL)
+		result = 0;
+	else
+		result = (int64) INSTR_TIME_GET_MICROSEC(tabentry->t_counts.t_blocks_time);
+
+	PG_RETURN_INT64(result);
+}
+
+Datum
 pg_stat_get_xact_function_calls(PG_FUNCTION_ARGS)
 {
 	Oid			funcid = PG_GETARG_OID(0);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5c910dd..9f7c6b7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1017,6 +1017,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_iotiming", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing information for database IO activity."),
+			NULL
+		},
+		&track_iotiming,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, STATS_COLLECTOR,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 315db46..9877afc 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -420,6 +420,7 @@
 
 #track_activities = on
 #track_counts = on
+#track_iotiming = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024 	# (change requires restart)
 #update_process_title = on
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 355c61a..19342dd 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2555,6 +2555,8 @@ DATA(insert OID = 1934 (  pg_stat_get_blocks_fetched	PGNSP PGUID 12 1 0 0 0 f f
 DESCR("statistics: number of blocks fetched");
 DATA(insert OID = 1935 (  pg_stat_get_blocks_hit		PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_blocks_hit _null_ _null_ _null_ ));
 DESCR("statistics: number of blocks found in cache");
+DATA(insert OID = 3947 (  pg_stat_get_blocks_time		PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_blocks_time _null_ _null_ _null_ ));
+DESCR("statistics: duration of block reads");
 DATA(insert OID = 2781 (  pg_stat_get_last_vacuum_time PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 1184 "26" _null_ _null_ _null_ _null_	pg_stat_get_last_vacuum_time _null_ _null_ _null_ ));
 DESCR("statistics: last manual vacuum time for a table");
 DATA(insert OID = 2782 (  pg_stat_get_last_autovacuum_time PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 1184 "26" _null_ _null_ _null_ _null_	pg_stat_get_last_autovacuum_time _null_ _null_ _null_ ));
@@ -2609,6 +2611,8 @@ DATA(insert OID = 1944 (  pg_stat_get_db_blocks_fetched PGNSP PGUID 12 1 0 0 0 f
 DESCR("statistics: blocks fetched for database");
 DATA(insert OID = 1945 (  pg_stat_get_db_blocks_hit		PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_db_blocks_hit _null_ _null_ _null_ ));
 DESCR("statistics: blocks found in cache for database");
+DATA(insert OID = 3948 (  pg_stat_get_db_blocks_time	PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_db_blocks_time _null_ _null_ _null_ ));
+DESCR("statistics: duration of block reads for database");
 DATA(insert OID = 2758 (  pg_stat_get_db_tuples_returned PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_db_tuples_returned _null_ _null_ _null_ ));
 DESCR("statistics: tuples returned for database");
 DATA(insert OID = 2759 (  pg_stat_get_db_tuples_fetched PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_db_tuples_fetched _null_ _null_ _null_ ));
@@ -2677,6 +2681,8 @@ DATA(insert OID = 3044 (  pg_stat_get_xact_blocks_fetched		PGNSP PGUID 12 1 0 0
 DESCR("statistics: number of blocks fetched in current transaction");
 DATA(insert OID = 3045 (  pg_stat_get_xact_blocks_hit			PGNSP PGUID 12 1 0 0 0 f f f t f v 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_hit _null_ _null_ _null_ ));
 DESCR("statistics: number of blocks found in cache in current transaction");
+DATA(insert OID = 3949 (  pg_stat_get_xact_blocks_time			PGNSP PGUID 12 1 0 0 0 f f f t f v 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_time _null_ _null_ _null_ ));
+DESCR("statistics: duration of block reads in current transaction");
 DATA(insert OID = 3046 (  pg_stat_get_xact_function_calls		PGNSP PGUID 12 1 0 0 0 f f f t f v 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_xact_function_calls _null_ _null_ _null_ ));
 DESCR("statistics: number of function calls in current transaction");
 DATA(insert OID = 3047 (  pg_stat_get_xact_function_time		PGNSP PGUID 12 1 0 0 0 f f f t f v 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_xact_function_time _null_ _null_ _null_ ));
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9ecb544..0c26bef 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -26,6 +26,8 @@ typedef struct BufferUsage
 	long		local_blks_written;		/* # of local disk blocks written */
 	long		temp_blks_read; /* # of temp blocks read */
 	long		temp_blks_written;		/* # of temp blocks written */
+	instr_time	time_read;	/* time spent reading */
+	instr_time	time_write;	/* time spent writing */
 } BufferUsage;
 
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8c6d82..777be45 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -95,6 +95,7 @@ typedef struct PgStat_TableCounts
 
 	PgStat_Counter t_blocks_fetched;
 	PgStat_Counter t_blocks_hit;
+	instr_time     t_blocks_time;
 } PgStat_TableCounts;
 
 /* Possible targets for resetting cluster-wide shared values */
@@ -496,6 +497,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_xact_rollback;
 	PgStat_Counter n_blocks_fetched;
 	PgStat_Counter n_blocks_hit;
+	instr_time	blocks_time;
 	PgStat_Counter n_tuples_returned;
 	PgStat_Counter n_tuples_fetched;
 	PgStat_Counter n_tuples_inserted;
@@ -543,6 +545,7 @@ typedef struct PgStat_StatTabEntry
 
 	PgStat_Counter blocks_fetched;
 	PgStat_Counter blocks_hit;
+	instr_time	blocks_time;
 
 	TimestampTz vacuum_timestamp;		/* user initiated vacuum */
 	PgStat_Counter vacuum_count;
@@ -765,6 +768,13 @@ extern void pgstat_initstats(Relation rel);
 		if ((rel)->pgstat_info != NULL)								\
 			(rel)->pgstat_info->t_counts.t_blocks_hit++;			\
 	} while (0)
+#define pgstat_count_buffer_time(rel, io)							\
+	do {															\
+		if ((rel)->pgstat_info != NULL)								\
+			INSTR_TIME_ADD(											\
+				(rel)->pgstat_info->t_counts.t_blocks_time,			\
+				(io));												\
+	} while (0)
 
 extern void pgstat_count_heap_insert(Relation rel, int n);
 extern void pgstat_count_heap_update(Relation rel, bool hot);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index a03c068..3b0f1b0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -48,6 +48,7 @@ extern PGDLLIMPORT int NBuffers;
 extern bool zero_damaged_pages;
 extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
+extern bool track_iotiming;
 extern int	target_prefetch_pages;
 
 /* in buf_init.c */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 454e1f9..983df54 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1296,7 +1296,7 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
  pg_stat_all_indexes             | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, pg_stat_get_numscans(i.oid) AS idx_scan, pg_stat_get_tuples_returned(i.oid) AS idx_tup_read, pg_stat_get_tuples_fetched(i.oid) AS idx_tup_fetch FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
  pg_stat_all_tables              | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, pg_stat_get_numscans(c.oid) AS seq_scan, pg_stat_get_tuples_returned(c.oid) AS seq_tup_read, (sum(pg_stat_get_numscans(i.indexrelid)))::bigint AS idx_scan, ((sum(pg_stat_get_tuples_fetched(i.indexrelid)))::bigint + pg_stat_get_tuples_fetched(c.oid)) AS idx_tup_fetch, pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins, pg_stat_get_tuples_updated(c.oid) AS n_tup_upd, pg_stat_get_tuples_deleted(c.oid) AS n_tup_del, pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd, pg_stat_get_live_tuples(c.oid) AS n_live_tup, pg_stat_get_dead_tuples(c.oid) AS n_dead_tup, pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum, pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum, pg_stat_get_last_analyze_time(c.oid) AS last_analyze, pg_stat_get_last_autoanalyze_time(c.oid) AS last_autoanalyze, pg_stat_get_vacuum_count(c.oid) AS vacuum_count, pg_stat_get_autovacuum_count(c.oid) AS autovacuum_count, pg_stat_get_analyze_count(c.oid) AS analyze_count, pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count FROM ((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname;
  pg_stat_bgwriter                | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
- pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
+ pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_blocks_time(d.oid) AS blks_time, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
  pg_stat_database_conflicts      | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
  pg_stat_replication             | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_hostname, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.replay_location, w.sync_priority, w.sync_state FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_hostname, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
  pg_stat_sys_indexes             | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
@@ -1308,15 +1308,15 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
  pg_stat_xact_sys_tables         | SELECT pg_stat_xact_all_tables.relid, pg_stat_xact_all_tables.schemaname, pg_stat_xact_all_tables.relname, pg_stat_xact_all_tables.seq_scan, pg_stat_xact_all_tables.seq_tup_read, pg_stat_xact_all_tables.idx_scan, pg_stat_xact_all_tables.idx_tup_fetch, pg_stat_xact_all_tables.n_tup_ins, pg_stat_xact_all_tables.n_tup_upd, pg_stat_xact_all_tables.n_tup_del, pg_stat_xact_all_tables.n_tup_hot_upd FROM pg_stat_xact_all_tables WHERE ((pg_stat_xact_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_xact_all_tables.schemaname ~ '^pg_toast'::text));
  pg_stat_xact_user_functions     | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_xact_function_calls(p.oid) AS calls, (pg_stat_get_xact_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_xact_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_xact_function_calls(p.oid) IS NOT NULL));
  pg_stat_xact_user_tables        | SELECT pg_stat_xact_all_tables.relid, pg_stat_xact_all_tables.schemaname, pg_stat_xact_all_tables.relname, pg_stat_xact_all_tables.seq_scan, pg_stat_xact_all_tables.seq_tup_read, pg_stat_xact_all_tables.idx_scan, pg_stat_xact_all_tables.idx_tup_fetch, pg_stat_xact_all_tables.n_tup_ins, pg_stat_xact_all_tables.n_tup_upd, pg_stat_xact_all_tables.n_tup_del, pg_stat_xact_all_tables.n_tup_hot_upd FROM pg_stat_xact_all_tables WHERE ((pg_stat_xact_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_xact_all_tables.schemaname !~ '^pg_toast'::text));
- pg_statio_all_indexes           | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, (pg_stat_get_blocks_fetched(i.oid) - pg_stat_get_blocks_hit(i.oid)) AS idx_blks_read, pg_stat_get_blocks_hit(i.oid) AS idx_blks_hit FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
- pg_statio_all_sequences         | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS blks_read, pg_stat_get_blocks_hit(c.oid) AS blks_hit FROM (pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = 'S'::"char");
- pg_statio_all_tables            | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS heap_blks_read, pg_stat_get_blocks_hit(c.oid) AS heap_blks_hit, (sum((pg_stat_get_blocks_fetched(i.indexrelid) - pg_stat_get_blocks_hit(i.indexrelid))))::bigint AS idx_blks_read, (sum(pg_stat_get_blocks_hit(i.indexrelid)))::bigint AS idx_blks_hit, (pg_stat_get_blocks_fetched(t.oid) - pg_stat_get_blocks_hit(t.oid)) AS toast_blks_read, pg_stat_get_blocks_hit(t.oid) AS toast_blks_hit, (pg_stat_get_blocks_fetched(x.oid) - pg_stat_get_blocks_hit(x.oid)) AS tidx_blks_read, pg_stat_get_blocks_hit(x.oid) AS tidx_blks_hit FROM ((((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_class t ON ((c.reltoastrelid = t.oid))) LEFT JOIN pg_class x ON ((t.reltoastidxid = x.oid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname, t.oid, x.oid;
- pg_statio_sys_indexes           | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_indexes.schemaname ~ '^pg_toast'::text));
- pg_statio_sys_sequences         | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_sequences.schemaname ~ '^pg_toast'::text));
- pg_statio_sys_tables            | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_tables.schemaname ~ '^pg_toast'::text));
- pg_statio_user_indexes          | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_indexes.schemaname !~ '^pg_toast'::text));
- pg_statio_user_sequences        | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_sequences.schemaname !~ '^pg_toast'::text));
- pg_statio_user_tables           | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_tables.schemaname !~ '^pg_toast'::text));
+ pg_statio_all_indexes           | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, (pg_stat_get_blocks_fetched(i.oid) - pg_stat_get_blocks_hit(i.oid)) AS idx_blks_read, pg_stat_get_blocks_hit(i.oid) AS idx_blks_hit, pg_stat_get_blocks_time(i.oid) AS idx_blks_time FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
+ pg_statio_all_sequences         | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS blks_read, pg_stat_get_blocks_hit(c.oid) AS blks_hit, pg_stat_get_blocks_time(c.oid) AS blks_time FROM (pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = 'S'::"char");
+ pg_statio_all_tables            | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS heap_blks_read, pg_stat_get_blocks_hit(c.oid) AS heap_blks_hit, pg_stat_get_blocks_time(c.oid) AS heap_blks_time, (sum((pg_stat_get_blocks_fetched(i.indexrelid) - pg_stat_get_blocks_hit(i.indexrelid))))::bigint AS idx_blks_read, (sum(pg_stat_get_blocks_hit(i.indexrelid)))::bigint AS idx_blks_hit, (sum(pg_stat_get_blocks_time(i.indexrelid)))::bigint AS idx_blks_time, (pg_stat_get_blocks_fetched(t.oid) - pg_stat_get_blocks_hit(t.oid)) AS toast_blks_read, pg_stat_get_blocks_hit(t.oid) AS toast_blks_hit, pg_stat_get_blocks_time(t.oid) AS toast_blks_time, (pg_stat_get_blocks_fetched(x.oid) - pg_stat_get_blocks_hit(x.oid)) AS tidx_blks_read, pg_stat_get_blocks_hit(x.oid) AS tidx_blks_hit, pg_stat_get_blocks_time(x.oid) AS tidx_blks_time FROM ((((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_class t ON ((c.reltoastrelid = t.oid))) LEFT JOIN pg_class x ON ((t.reltoastidxid = x.oid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname, t.oid, x.oid;
+ pg_statio_sys_indexes           | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit, pg_statio_all_indexes.idx_blks_time FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_indexes.schemaname ~ '^pg_toast'::text));
+ pg_statio_sys_sequences         | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit, pg_statio_all_sequences.blks_time FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_sequences.schemaname ~ '^pg_toast'::text));
+ pg_statio_sys_tables            | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.heap_blks_time, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.idx_blks_time, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.toast_blks_time, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit, pg_statio_all_tables.tidx_blks_time FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_tables.schemaname ~ '^pg_toast'::text));
+ pg_statio_user_indexes          | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit, pg_statio_all_indexes.idx_blks_time FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_indexes.schemaname !~ '^pg_toast'::text));
+ pg_statio_user_sequences        | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit, pg_statio_all_sequences.blks_time FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_sequences.schemaname !~ '^pg_toast'::text));
+ pg_statio_user_tables           | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.heap_blks_time, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.idx_blks_time, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.toast_blks_time, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit, pg_statio_all_tables.tidx_blks_time FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_tables.schemaname !~ '^pg_toast'::text));
  pg_stats                        | SELECT n.nspname AS schemaname, c.relname AS tablename, a.attname, s.stainherit AS inherited, s.stanullfrac AS null_frac, s.stawidth AS avg_width, s.stadistinct AS n_distinct, CASE WHEN (s.stakind1 = ANY (ARRAY[1, 4])) THEN s.stavalues1 WHEN (s.stakind2 = ANY (ARRAY[1, 4])) THEN s.stavalues2 WHEN (s.stakind3 = ANY (ARRAY[1, 4])) THEN s.stavalues3 WHEN (s.stakind4 = ANY (ARRAY[1, 4])) THEN s.stavalues4 ELSE NULL::anyarray END AS most_common_vals, CASE WHEN (s.stakind1 = ANY (ARRAY[1, 4])) THEN s.stanumbers1 WHEN (s.stakind2 = ANY (ARRAY[1, 4])) THEN s.stanumbers2 WHEN (s.stakind3 = ANY (ARRAY[1, 4])) THEN s.stanumbers3 WHEN (s.stakind4 = ANY (ARRAY[1, 4])) THEN s.stanumbers4 ELSE NULL::real[] END AS most_common_freqs, CASE WHEN (s.stakind1 = 2) THEN s.stavalues1 WHEN (s.stakind2 = 2) THEN s.stavalues2 WHEN (s.stakind3 = 2) THEN s.stavalues3 WHEN (s.stakind4 = 2) THEN s.stavalues4 ELSE NULL::anyarray END AS histogram_bounds, CASE WHEN (s.stakind1 = 3) THEN s.stanumbers1[1] WHEN (s.stakind2 = 3) THEN s.stanumbers2[1] WHEN (s.stakind3 = 3) THEN s.stanumbers3[1] WHEN (s.stakind4 = 3) THEN s.stanumbers4[1] ELSE NULL::real END AS correlation FROM (((pg_statistic s JOIN pg_class c ON ((c.oid = s.starelid))) JOIN pg_attribute a ON (((c.oid = a.attrelid) AND (a.attnum = s.staattnum)))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE ((NOT a.attisdropped) AND has_column_privilege(c.oid, a.attnum, 'select'::text));
  pg_tables                       | SELECT n.nspname AS schemaname, c.relname AS tablename, pg_get_userbyid(c.relowner) AS tableowner, t.spcname AS tablespace, c.relhasindex AS hasindexes, c.relhasrules AS hasrules, c.relhastriggers AS hastriggers FROM ((pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) LEFT JOIN pg_tablespace t ON ((t.oid = c.reltablespace))) WHERE (c.relkind = 'r'::"char");
  pg_timezone_abbrevs             | SELECT pg_timezone_abbrevs.abbrev, pg_timezone_abbrevs.utc_offset, pg_timezone_abbrevs.is_dst FROM pg_timezone_abbrevs() pg_timezone_abbrevs(abbrev, utc_offset, is_dst);

#18

Greg Smith

greg@2ndQuadrant.com

almost 14 years ago

In reply to: Ants Aasma (#17)

Re: Patch: add timing of buffer I/O requests

On 01/15/2012 05:14 PM, Ants Aasma wrote:

I hope that having a tool to measure the overhead and check the sanity
of clock sources is enough to answer the worries about the potential
performance hit. We could also check that the clock source is fast
enough on start-up/when the guc is changed, but that seems a bit too
much and leaves open the question about what is fast enough.

I've been thinking along those same lines--check at startup, provide
some guidance on the general range of what's considered fast vs. slow in
both the code and documentation. What I'm hoping to do here is split
your patch in half and work on the pg_test_timing contrib utility
first. That part answers some overdue questions about when EXPLAIN
ANALYZE can be expected to add a lot of overhead, which means it's
useful all on its own. I'd like to see that utility go into 9.2, along
with a new documentation section covering that topic. I'll write the
new documentation bit.

As far as the buffer timing goes, there is a lot of low-level timing
information I'd like to see the database collect. To pick a second
example with very similar mechanics, I'd like to know which queries
spend a lot of their time waiting on locks. The subset of time a
statement spends waiting just for commit related things is a third. The
industry standard term for these is wait events, as seen in Oracle,
MySQL, MS SQL Server. etc. That's so standard I don't see an
intellectual property issue with PostgreSQL using the same term. Talk
with a random person who is converting from Oracle to PostgreSQL, ask
them about their performance concerns. At least 3/4 of those
conversations I have mention being nervous about not having wait event data.

Right now, I feel the biggest hurdle to performance tuning PostgreSQL is
not having good enough built-in query log analysis tools. If the
pg_stat_statements normalization upgrade in the CF queue is commited,
that's enough to make me bump that to "solved well enough". After
clearing that hurdle, figuring out how to log, analyze, and manage
storage of wait events is the next biggest missing piece. One of my top
goals for 9.3 was to make sure that happened.

I don't think the long-term answer for how to manage wait event data is
to collect it as part of pg_stat_statements though. But I don't have a
good alternate proposal, while you've submitted a patch that actually
does something useful right now. I'm going to think some more about how
to reconcile all that. There is an intermediate point to consider as
well, which is just committing something that adjusts the core code to
make the buffer wait event data available. pg_stat_statements is easy
enough to continue work on outside of core. I could see a path where
that happens in parallel with adding a better core wait event
infrastructure, just to get the initial buffer wait info into people's
hands earlier.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#19

Greg Smith

greg@2ndQuadrant.com

almost 14 years ago

In reply to: Greg Smith (#18)

1 attachment(s)

Re: Patch: add timing of buffer I/O requests

Attached is the pg_test_timing utility portion of this submission,
broken out into its own patch. It's a contrib module modeled on
pg_test_fsync.

The documentation is still a bit rough, I'm not done with that yet. I
have included an example of good timing results, switching to a bad
clock source, and the resulting bad results. Code review found some
formatting things to nitpick I've already fixed: non-standard brace
locations and not including enough spaces in expressions were the main two.

This is now referenced by the existing cryptic documentation comment
around EXPLAIN ANALYZE, which says that overhead can be high because
gettimeofday is slow on some systems. Since this utility measures that
directly, I think it's a clear win to include it just for that purpose.
The fact that there are more places coming where timing overhead matters
is also true. But this existing one is already bad enough to justify
shipping something to help measure/manage it in my mind.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachments:

pg_test_timing-v2.patchtext/x-patch; name=pg_test_timing-v2.patchDownload

diff --git a/contrib/Makefile b/contrib/Makefile
index 0c238aa..45b601c 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
 		pg_standby	\
 		pg_stat_statements \
 		pg_test_fsync	\
+		pg_test_timing	\
 		pg_trgm		\
 		pg_upgrade	\
 		pg_upgrade_support \
diff --git a/contrib/pg_test_timing/Makefile b/contrib/pg_test_timing/Makefile
new file mode 100644
index 0000000..b8b266a
--- /dev/null
+++ b/contrib/pg_test_timing/Makefile
@@ -0,0 +1,18 @@
+# contrib/pg_test_timing/Makefile
+
+PGFILEDESC = "pg_test_timing - test timing overhead"
+PGAPPICON = win32
+
+PROGRAM  = pg_test_timing
+OBJS = pg_test_timing.o
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_test_timing
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_test_timing/pg_test_timing.c b/contrib/pg_test_timing/pg_test_timing.c
new file mode 100644
index 0000000..bcf3c3a
--- /dev/null
+++ b/contrib/pg_test_timing/pg_test_timing.c
@@ -0,0 +1,157 @@
+/*
+ *	pg_test_timing.c
+ *		tests overhead of timing calls and their monoticity:  that
+ * 		they always move forward
+ */
+
+#include "postgres_fe.h"
+
+#include "getopt_long.h"
+#include "portability/instr_time.h"
+
+static const char *progname;
+
+static int32 test_duration = 3;
+
+static void handle_args(int argc, char *argv[]);
+static void test_timing(int32);
+
+int
+main(int argc, char *argv[])
+{
+	progname = get_progname(argv[0]);
+
+	handle_args(argc, argv);
+
+	test_timing(test_duration);
+
+	return 0;
+}
+
+static void
+handle_args(int argc, char *argv[])
+{
+	static struct option long_options[] = {
+		{"duration", required_argument, NULL, 'd'},
+		{NULL, 0, NULL, 0}
+	};
+	int option;			/* Command line option */
+	int	optindex = 0;	/* used by getopt_long */
+
+	if (argc > 1)
+	{
+		if (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-h") == 0 ||
+			strcmp(argv[1], "-?") == 0)
+		{
+			printf("Usage: %s [-d DURATION]\n", progname);
+			exit(0);
+		}
+		if (strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-V") == 0)
+		{
+			puts("pg_test_timing (PostgreSQL) " PG_VERSION);
+			exit(0);
+		}
+	}
+
+	while ((option = getopt_long(argc, argv, "d:",
+								 long_options, &optindex)) != -1)
+	{
+		switch (option)
+		{
+			case 'd':
+				test_duration = atoi(optarg);
+				break;
+
+			default:
+				fprintf(stderr, "Try \"%s --help\" for more information.\n",
+						progname);
+				exit(1);
+				break;
+		}
+	}
+
+	if (argc > optind)
+	{
+		fprintf(stderr,
+				"%s: too many command-line arguments (first is \"%s\")\n",
+				progname, argv[optind]);
+		fprintf(stderr, "Try \"%s --help\" for more information.\n",
+				progname);
+		exit(1);
+	}
+
+	if (test_duration > 0)
+	{
+		printf("Testing timing overhead for %d seconds.\n", test_duration);
+	}
+	else
+	{
+		printf("Testing timing was interrupted.\n");
+	}
+}
+
+static void
+test_timing(int32 duration)
+{
+	uint64 total_time;
+	int64 time_elapsed = 0;
+	uint64 loop_count = 0;
+	uint64 prev, cur;
+	int32 diff, i, bits, found;
+
+	instr_time start_time, end_time, temp;
+
+	static int64 histogram[32];
+
+	total_time = duration > 0 ? duration * 1000000 : 0;
+
+	INSTR_TIME_SET_CURRENT(start_time);
+	cur = INSTR_TIME_GET_MICROSEC(start_time);
+
+	while (time_elapsed < total_time)
+	{
+		prev = cur;
+		INSTR_TIME_SET_CURRENT(temp);
+		cur = INSTR_TIME_GET_MICROSEC(temp);
+		diff = cur - prev;
+
+		if (diff < 0)
+		{
+			printf("Detected clock going backwards in time.\n");
+			printf("Time warp: %d microseconds\n", diff);
+			exit(1);
+		}
+
+		bits = 0;
+		while (diff)
+		{
+			diff >>= 1;
+			bits++;
+		}
+		histogram[bits]++;
+
+		loop_count++;
+		INSTR_TIME_SUBTRACT(temp, start_time);
+		time_elapsed = INSTR_TIME_GET_MICROSEC(temp);
+	}
+
+	INSTR_TIME_SET_CURRENT(end_time);
+
+	INSTR_TIME_SUBTRACT(end_time, start_time);
+
+	printf("Per timing duration including loop overhead: %0.2f ns\n",
+			INSTR_TIME_GET_DOUBLE(end_time) * 1e9 / loop_count);
+	printf("Histogram of timing durations:\n");
+	printf("%9s: %10s %9s\n", "< usec", "count", "percent");
+
+	found = 0;
+    for (i = 31; i >= 0; i--)
+    {
+        if (found || histogram[i])
+        {
+            found = 1;
+            printf("%9ld: %10ld %8.5f%%\n", 1l << i, histogram[i],
+                (double) histogram[i] * 100 / loop_count);
+        }
+    }
+}
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index adf09ca..b418688 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -121,6 +121,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  &pgstatstatements;
  &pgstattuple;
  &pgtestfsync;
+ &pgtesttiming;
  &pgtrgm;
  &pgupgrade;
  &seg;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index b96dd65..38c9334 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -129,6 +129,7 @@
 <!ENTITY pgstatstatements SYSTEM "pgstatstatements.sgml">
 <!ENTITY pgstattuple     SYSTEM "pgstattuple.sgml">
 <!ENTITY pgtestfsync     SYSTEM "pgtestfsync.sgml">
+<!ENTITY pgtesttiming    SYSTEM "pgtesttiming.sgml">
 <!ENTITY pgtrgm          SYSTEM "pgtrgm.sgml">
 <!ENTITY pgupgrade       SYSTEM "pgupgrade.sgml">
 <!ENTITY seg             SYSTEM "seg.sgml">
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 8e695fd..5a8c6fc 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -770,7 +770,9 @@ ROLLBACK;
     network transmission costs and I/O conversion costs are not included.
     Second, the measurement overhead added by <command>EXPLAIN
     ANALYZE</command> can be significant, especially on machines with slow
-    <function>gettimeofday()</> operating-system calls.
+    <function>gettimeofday()</> operating-system calls. You can use the
+    <xref linkend="pgtesttiming"> tool to measure the overhead of timing
+    on your system.
    </para>
 
    <para>
diff --git a/doc/src/sgml/pgtesttiming.sgml b/doc/src/sgml/pgtesttiming.sgml
new file mode 100644
index 0000000..f330265
--- /dev/null
+++ b/doc/src/sgml/pgtesttiming.sgml
@@ -0,0 +1,116 @@
+<!-- doc/src/sgml/pgtesttiming.sgml -->
+
+<sect1 id="pgtesttiming" xreflabel="pg_test_timing">
+ <title>pg_test_timing</title>
+
+ <indexterm zone="pgtesttiming">
+  <primary>pg_test_timing</primary>
+ </indexterm>
+
+ <para>
+  <application>pg_test_timing</> is a tool to measure the timing overhead
+  on your system and confirm that the system time never moves backwards.
+ </para>
+
+ <sect2>
+  <title>Usage</title>
+
+<synopsis>
+pg_test_timing [options]
+</synopsis>
+
+   <para>
+    <application>pg_test_timing</application> accepts the following
+    command-line options:
+
+    <variablelist>
+
+     <varlistentry>
+      <term><option>-d</option></term>
+      <term><option>--duration</option></term>
+      <listitem>
+       <para>
+        Specifies the number of seconds to run the test. Longer durations
+        give slightly better accuracy, and are more likely to discover
+        problems with the system clock moving backwards. The default
+        test duration is 3 seconds.
+       </para>
+      </listitem>
+     </varlistentry>
+
+    </variablelist>
+   </para>
+
+ </sect2>
+
+ <sect2>
+  <title>Interpreting results</title>
+  <para>
+        Collecting accurate timing information on 
+        Linux includes the options tsc, hpet, and acpi_pm 
+  </para>
+  <para>
+        Good results will show most individual timing calls take less
+        than one nanoseconds, such as this example from an Intel i7-860
+        system using the tsc clock source:
+
+<screen>
+Testing timing overhead for 3 seconds.
+Per timing duration including loop overhead: 35.96 ns
+Histogram of timing durations:
+   < usec:      count   percent
+       16:          2  0.00000%
+        8:         13  0.00002%
+        4:        126  0.00015%
+        2:    2999652  3.59518%
+        1:   80435604 96.40465%
+</screen>
+
+  </para>
+ </sect2>
+
+ <sect2>
+  <title>Changing time sources</title>
+  <para>
+        On Linux systems, it's possible to change the clock source
+        used to collect timing data at any time.  A second example
+        shows the slowdown possible from switching to the slower acpi_pm
+        time source, on the same system used for the fast results above:
+
+<screen>
+# cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
+tsc hpet acpi_pm
+# echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
+# pg_test_timing
+Per timing duration including loop overhead: 722.92 ns
+Histogram of timing durations:
+   < usec:      count   percent
+       16:          3  0.00007%
+        8:        563  0.01357%
+        4:       3241  0.07810%
+        2:    2990371 72.05956%
+        1:    1155682 27.84870%
+</screen>
+
+  </para>
+
+  <para>
+        The tsc clock source is the most accurate one available on current
+        generation CPUs.  However, trying to use it on older CPUs sometimes
+        resulted in the reported time being inconsistent among multiple cores.
+        Newer versions of Linux will check for this and switch to a slower,
+        more stable clock sources instead.  Older ones would allow using tsc
+        in situations where it's now known to be inaccurate, which can
+        result in system instability.
+  </para>
+ </sect2>
+
+ <sect2>
+  <title>Author</title>
+
+  <para>
+   Ants Aasma <email>ants.aasma@eesti.ee</email>
+  </para>
+ </sect2>
+
+</sect1>

#20

Greg Smith

greg@2ndQuadrant.com

almost 14 years ago

In reply to: Ants Aasma (#1)

2 attachment(s)

Re: Patch: add timing of buffer I/O requests

Attached are updated versions of this feature without the pg_test_timing
tool part, since I broke that out into another discussion thread. I've
split the part that updates pg_stat_statistics out from the main feature
too, separate patch attached to here (but I'm not reviewing that yet).
Lots of bitrot since this was submitted, and yes I noticed that I've
almost recreated earlier versions of this patch--by splitting off the
parts that were developed later.

Earlier discussion of this got side tracked on a few things, partly my
fault. It's worth taking a look at what this provides before judging it
too much. It can demo well.

The stated purpose is helping figure out what relations are gobbling up
the most access time, presumably to optimize them and/or the storage
they are on. "What do I put onto SSD" is surely a popular request
nowadays. To check suitability for that, I decided to run the standard
pgbench test and see what it would show lots of time being consumed by.
Any answer other than "pgbench_accounts and to a lesser extent its
index" is a failing grade. I started with a clean database and OS cache
so I'd get real read timings:

$ psql -d pgbench -x -c "select
relname,heap_blks_read,heap_blks_hit,heap_blks_time,
idx_blks_read ,idx_blks_hit,idx_blks_time
from pg_statio_user_tables where idx_blks_read > 0
order by heap_blks_time desc"

Now, the first critical question to ask is "what additional information
is this providing above the existing counters?" After all, it's
possible to tell pgbench_accounts is the hotspot just from comparing
heap_blks_read, right? To really be useful, this would need to make it
obvious that reads from pgbench_accounts are slower than the other two,
because it's bigger and requires more seeking around to populate. That
should show up if we compute time per read numbers:

$ psql -d pgbench -x -c "select relname,
1.0 * heap_blks_time / heap_blks_read as time_per_read,
1.0 * idx_blks_time / idx_blks_read as time_per_idx_read
from pg_statio_user_tables where idx_blks_read > 0
order by heap_blks_time desc"

relname | pgbench_accounts
time_per_read | 19.2625967762406397
time_per_idx_read | 9.4586165533786485

relname | pgbench_tellers
time_per_read | 5.5263157894736842
time_per_idx_read | 5.6363636363636364

relname | pgbench_branches
time_per_read | 7.0000000000000000
time_per_idx_read | 4.5000000000000000

This run looks useful at providing the data wished for--that read times
are slower per capita from the accounts table. The first time I tried
this I got a bizarre high number for pgbench_branches.heap_blks_time ;
I'm not sure how reliable this is yet. One problem that might be easy
to fix is that the write timing info doesn't show in any of these system
views, only in EXPLAIN and statement level ones.

I still think a full wait timing interface is the right long-term
direction here. It's hard to reject this idea when it seems to be
working right now though, while more comprehensive wait storage is still
at least a release off. Opinions welcome, I'm still juggling this
around now that I have it working again.

Some implementation notes. This currently fails regression test
create_function_3, haven't looked into why yet. I've confirmed that on
a system where timing is cheap, this is too. On something touching real
data, not just a synthetic thing moving memory around, Aants couldn't
measure it on a server similar to mine; I can't either. Yes, this is
going to gobble up more room for statistics.

The track_iotiming GUC seems to work as expected. Off by default, can
turn it on in a session and see that session's work get timed, and it
toggles on a config reload. Everything needed to only turn it on
selectively; main penalty you pay all the time is the stats bloat.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachments:

io-stats.v3.patchtext/x-patch; name=io-stats.v3.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0ea9aeb..ea09f3b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4205,6 +4205,25 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-iotiming" xreflabel="track_iotiming">
+      <term><varname>track_iotiming</varname> (<type>boolean</type>)</term>
+      <indexterm>
+       <primary><varname>track_iotiming</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Enables timing of database I/O calls.
+        This parameter is off by default, because it may cause significant
+        overhead if the platform doesn't support fast timing information.
+        Only superusers can change this setting.
+       </para>
+       <para>
+        You can use the <xref linkend="pgtesttiming"> tool to find out the
+        overhead of timing on your system.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)</term>
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index cb13c8e..3821d37 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -116,7 +116,8 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
    <productname>PostgreSQL</productname>'s <firstterm>statistics collector</>
    is a subsystem that supports collection and reporting of information about
    server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
+   and indexes in both disk-block and individual-row terms and time disk-block
+   accesses.  It also tracks
    the total number of rows in each table, and information about vacuum and
    analyze actions for each table.  It can also count calls to user-defined
    functions and the total time spent in each one.
@@ -145,6 +146,11 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
   </para>
 
   <para>
+   The parameter <xref linkend="guc-track-iotiming"> enables timing of I/O
+   requests.
+  </para>
+
+  <para>
    The parameter <xref linkend="guc-track-functions"> enables tracking of
    usage of user-defined functions.
   </para>
@@ -403,8 +409,9 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
       blocks read from that table, number of buffer hits, numbers of
       disk blocks read and buffer hits in all indexes of that table,
       numbers of disk blocks read and buffer hits from that table's
-      auxiliary TOAST table (if any), and numbers of disk blocks read
-      and buffer hits for the TOAST table's index.
+      auxiliary TOAST table (if any), numbers of disk blocks read
+      and buffer hits for the TOAST table's index and microseconds
+      spent reading the blocks for each category.
       </entry>
      </row>
 
@@ -424,7 +431,8 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
       <entry><structname>pg_statio_all_indexes</><indexterm><primary>pg_statio_all_indexes</primary></indexterm></entry>
       <entry>For each index in the current database,
       the table and index OID, schema, table and index name,
-      numbers of disk blocks read and buffer hits in that index.
+      numbers of disk blocks read, microseconds spent reading the blocks
+      and buffer hits in that index.
       </entry>
      </row>
 
@@ -523,7 +531,10 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
    handles disk I/O, data that is not in the
    <productname>PostgreSQL</> buffer cache might still reside in the
    kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
+   requiring a physical read.
+   Timing info shows how much user queries were delayed by buffer reads
+   in aggregate.
+   Users interested in obtaining more
    detailed information on <productname>PostgreSQL</> I/O behavior are
    advised to use the <productname>PostgreSQL</> statistics collector
    in combination with operating system utilities that allow insight
@@ -795,6 +806,15 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
      </row>
 
      <row>
+      <entry><literal><function>pg_stat_get_db_blocks_time</function>(<type>oid</type>)</literal></entry>
+      <entry><type>bigint</type></entry>
+      <entry>
+       Microseconds spent across all backends waiting for disk block fetch
+       requests for database
+      </entry>
+     </row>
+
+     <row>
       <entry><literal><function>pg_stat_get_db_tuples_returned</function>(<type>oid</type>)</literal></entry>
       <entry><type>bigint</type></entry>
       <entry>
@@ -1010,6 +1030,15 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
      </row>
 
      <row>
+      <entry><literal><function>pg_stat_get_blocks_time</function>(<type>oid</type>)</literal></entry>
+      <entry><type>bigint</type></entry>
+      <entry>
+       Microseconds spent across all backends waiting for disk block fetch requests
+       for table or index
+      </entry>
+     </row>
+
+     <row>
       <entry><literal><function>pg_stat_get_last_vacuum_time</function>(<type>oid</type>)</literal></entry>
       <entry><type>timestamptz</type></entry>
       <entry>
@@ -1151,6 +1180,14 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
      </row>
 
      <row>
+      <entry><literal><function>pg_stat_get_xact_blocks_time</function>(<type>oid</type>)</literal></entry>
+      <entry><type>bigint</type></entry>
+      <entry>
+       Microseconds spent waiting for  disk block requests for table or index, in the current transaction
+      </entry>
+     </row>
+
+     <row>
        <!-- See also the entry for this in func.sgml -->
       <entry><literal><function>pg_backend_pid()</function></literal></entry>
       <entry><type>integer</type></entry>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 30b0bd0..e0999d5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -415,15 +415,19 @@ CREATE VIEW pg_statio_all_tables AS
             pg_stat_get_blocks_fetched(C.oid) -
                     pg_stat_get_blocks_hit(C.oid) AS heap_blks_read,
             pg_stat_get_blocks_hit(C.oid) AS heap_blks_hit,
+            pg_stat_get_blocks_time(C.oid) AS heap_blks_time,
             sum(pg_stat_get_blocks_fetched(I.indexrelid) -
                     pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_read,
             sum(pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_hit,
+            sum(pg_stat_get_blocks_time(I.indexrelid))::bigint AS idx_blks_time,
             pg_stat_get_blocks_fetched(T.oid) -
                     pg_stat_get_blocks_hit(T.oid) AS toast_blks_read,
             pg_stat_get_blocks_hit(T.oid) AS toast_blks_hit,
+            pg_stat_get_blocks_time(T.oid) AS toast_blks_time,
             pg_stat_get_blocks_fetched(X.oid) -
                     pg_stat_get_blocks_hit(X.oid) AS tidx_blks_read,
-            pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit
+            pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit,
+            pg_stat_get_blocks_time(X.oid) AS tidx_blks_time
     FROM pg_class C LEFT JOIN
             pg_index I ON C.oid = I.indrelid LEFT JOIN
             pg_class T ON C.reltoastrelid = T.oid LEFT JOIN
@@ -477,7 +481,8 @@ CREATE VIEW pg_statio_all_indexes AS
             I.relname AS indexrelname,
             pg_stat_get_blocks_fetched(I.oid) -
                     pg_stat_get_blocks_hit(I.oid) AS idx_blks_read,
-            pg_stat_get_blocks_hit(I.oid) AS idx_blks_hit
+            pg_stat_get_blocks_hit(I.oid) AS idx_blks_hit,
+            pg_stat_get_blocks_time(I.oid) AS idx_blks_time
     FROM pg_class C JOIN
             pg_index X ON C.oid = X.indrelid JOIN
             pg_class I ON I.oid = X.indexrelid
@@ -501,7 +506,8 @@ CREATE VIEW pg_statio_all_sequences AS
             C.relname AS relname,
             pg_stat_get_blocks_fetched(C.oid) -
                     pg_stat_get_blocks_hit(C.oid) AS blks_read,
-            pg_stat_get_blocks_hit(C.oid) AS blks_hit
+            pg_stat_get_blocks_hit(C.oid) AS blks_hit,
+            pg_stat_get_blocks_time(C.oid) AS blks_time
     FROM pg_class C
             LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
     WHERE C.relkind = 'S';
@@ -570,6 +576,7 @@ CREATE VIEW pg_stat_database AS
             pg_stat_get_db_blocks_fetched(D.oid) -
                     pg_stat_get_db_blocks_hit(D.oid) AS blks_read,
             pg_stat_get_db_blocks_hit(D.oid) AS blks_hit,
+            pg_stat_get_db_blocks_time(D.oid) AS blks_time,
             pg_stat_get_db_tuples_returned(D.oid) AS tup_returned,
             pg_stat_get_db_tuples_fetched(D.oid) AS tup_fetched,
             pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a1692f8..d99a2cc 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1189,9 +1189,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 									 usage->local_blks_written);
 			bool		has_temp = (usage->temp_blks_read > 0 ||
 									usage->temp_blks_written);
+			bool		has_timing = (!INSTR_TIME_IS_ZERO(usage->time_read) ||
+									  !INSTR_TIME_IS_ZERO(usage->time_write));
+
 
 			/* Show only positive counter values. */
-			if (has_shared || has_local || has_temp)
+			if (has_shared || has_local || has_temp || has_timing)
 			{
 				appendStringInfoSpaces(es->str, es->indent * 2);
 				appendStringInfoString(es->str, "Buffers:");
@@ -1236,6 +1239,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
 						appendStringInfo(es->str, " written=%ld",
 										 usage->temp_blks_written);
 				}
+				if (has_timing)
+				{
+					appendStringInfoString(es->str, " timing");
+					if (!INSTR_TIME_IS_ZERO(usage->time_read)) {
+						appendStringInfo(es->str, " read=%0.2f",
+										 INSTR_TIME_GET_MILLISEC(usage->time_read));
+					}
+					if (!INSTR_TIME_IS_ZERO(usage->time_write)) {
+						appendStringInfo(es->str, " write=%0.2f",
+										 INSTR_TIME_GET_MILLISEC(usage->time_write));
+					}
+				}
 				appendStringInfoChar(es->str, '\n');
 			}
 		}
@@ -1249,6 +1264,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
 			ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
 			ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
+			ExplainPropertyFloat("Read Waits", INSTR_TIME_GET_MILLISEC(usage->time_read), 3, es);
+			ExplainPropertyFloat("Write Waits", INSTR_TIME_GET_MILLISEC(usage->time_write), 3, es);
 		}
 	}
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2c749b1..2ab76b3 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -143,4 +143,6 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	INSTR_TIME_ACCUM_DIFF(dst->time_read, add->time_read, sub->time_read);
+	INSTR_TIME_ACCUM_DIFF(dst->time_write, add->time_write, sub->time_write);
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a53fc52..e9f6a3b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3338,6 +3338,7 @@ pgstat_get_db_entry(Oid databaseid, bool create)
 		result->n_xact_rollback = 0;
 		result->n_blocks_fetched = 0;
 		result->n_blocks_hit = 0;
+		INSTR_TIME_SET_ZERO(result->blocks_time);
 		result->n_tuples_returned = 0;
 		result->n_tuples_fetched = 0;
 		result->n_tuples_inserted = 0;
@@ -3412,6 +3413,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 		result->changes_since_analyze = 0;
 		result->blocks_fetched = 0;
 		result->blocks_hit = 0;
+		INSTR_TIME_SET_ZERO(result->blocks_time);
 		result->vacuum_timestamp = 0;
 		result->vacuum_count = 0;
 		result->autovac_vacuum_timestamp = 0;
@@ -4100,6 +4102,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
 			tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
+			tabentry->blocks_time = tabmsg->t_counts.t_blocks_time;
 
 			tabentry->vacuum_timestamp = 0;
 			tabentry->vacuum_count = 0;
@@ -4127,6 +4130,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
 			tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
+			INSTR_TIME_ADD(tabentry->blocks_time, tabmsg->t_counts.t_blocks_time);
 		}
 
 		/* Clamp n_live_tuples in case of negative delta_live_tuples */
@@ -4144,6 +4148,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 		dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
 		dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
 		dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
+		INSTR_TIME_ADD(dbentry->blocks_time, tabmsg->t_counts.t_blocks_time);
 	}
 }
 
@@ -4257,6 +4262,7 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
 	dbentry->n_xact_rollback = 0;
 	dbentry->n_blocks_fetched = 0;
 	dbentry->n_blocks_hit = 0;
+	INSTR_TIME_SET_ZERO(dbentry->blocks_time);
 	dbentry->n_tuples_returned = 0;
 	dbentry->n_tuples_fetched = 0;
 	dbentry->n_tuples_inserted = 0;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1adb6d3..22d4612 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -67,6 +67,7 @@
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
+bool		track_iotiming = false;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -86,7 +87,7 @@ static volatile BufferDesc *PinCountWaitBuf = NULL;
 static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 				  ForkNumber forkNum, BlockNumber blockNum,
 				  ReadBufferMode mode, BufferAccessStrategy strategy,
-				  bool *hit);
+				  bool *hit, instr_time *io_time);
 static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
@@ -224,6 +225,7 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
 	bool		hit;
+	instr_time	io_time;
 	Buffer		buf;
 
 	/* Open it at the smgr level if not already done */
@@ -245,9 +247,11 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	pgstat_count_buffer_read(reln);
 	buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
+							forkNum, blockNum, mode, strategy, &hit, &io_time);
 	if (hit)
 		pgstat_count_buffer_hit(reln);
+	else
+		pgstat_count_buffer_time(reln, io_time);
 	return buf;
 }
 
@@ -267,11 +271,12 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BufferAccessStrategy strategy)
 {
 	bool		hit;
+	instr_time	io_time;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
 	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
-							 mode, strategy, &hit);
+							 mode, strategy, &hit, &io_time);
 }
 
 
@@ -279,19 +284,22 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
+ * If track_iotiming is enabled, *io_time is set to the time the read took.
  */
 static Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy, bool *hit, instr_time *io_time)
 {
 	volatile BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
+	instr_time io_start, io_end;
 
 	*hit = false;
+	INSTR_TIME_SET_ZERO(*io_time);
 
 	/* Make sure we will have room to remember the buffer pin */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -437,8 +445,18 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
+			if (track_iotiming)
+				INSTR_TIME_SET_CURRENT(io_start);
+
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			if (track_iotiming)
+			{
+				INSTR_TIME_SET_CURRENT(io_end);
+				INSTR_TIME_ACCUM_DIFF(*io_time, io_end, io_start);
+				INSTR_TIME_ADD(pgBufferUsage.time_read, *io_time);
+			}
+
 			/* check for garbage data */
 			if (!PageHeaderIsValid((PageHeader) bufBlock))
 			{
@@ -1873,6 +1891,7 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcontext;
+	instr_time io_start, io_end;
 
 	/*
 	 * Acquire the buffer's io_in_progress lock.  If StartBufferIO returns
@@ -1920,12 +1939,21 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 	buf->flags &= ~BM_JUST_DIRTIED;
 	UnlockBufHdr(buf);
 
+	if (track_iotiming)
+		INSTR_TIME_SET_CURRENT(io_start);
+
 	smgrwrite(reln,
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  (char *) BufHdrGetBlock(buf),
 			  false);
 
+	if (track_iotiming)
+	{
+		INSTR_TIME_SET_CURRENT(io_end);
+		INSTR_TIME_ACCUM_DIFF(pgBufferUsage.time_write, io_end, io_start);
+	}
+
 	pgBufferUsage.shared_blks_written++;
 
 	/*
@@ -2669,6 +2697,7 @@ WaitIO(volatile BufferDesc *buf)
 static bool
 StartBufferIO(volatile BufferDesc *buf, bool forInput)
 {
+	instr_time wait_start, wait_end;
 	Assert(!InProgressBuf);
 
 	for (;;)
@@ -2677,7 +2706,28 @@ StartBufferIO(volatile BufferDesc *buf, bool forInput)
 		 * Grab the io_in_progress lock so that other processes can wait for
 		 * me to finish the I/O.
 		 */
-		LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+		if (forInput && track_iotiming) {
+			/*
+			 * We need to time the lock wait to account for I/O waits where
+			 * someone else is doing the work for us. Conditional acquire
+			 * avoids double timing overhead when we do the I/O ourselves.
+			 */
+			if (!LWLockConditionalAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE)) {
+				INSTR_TIME_SET_CURRENT(wait_start);
+
+				LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+
+				/*
+				 * Only do backend local accounting, stats collector will get the
+				 * wait from the backend doing the I/O.
+				 */
+				INSTR_TIME_SET_CURRENT(wait_end);
+				INSTR_TIME_ACCUM_DIFF(pgBufferUsage.time_read, wait_end, wait_start);
+			}
+		} else {
+			LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+		}
+
 
 		LockBufHdr(buf);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 68b2527..d0b6c57 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -22,6 +22,7 @@
 #include "utils/builtins.h"
 #include "utils/inet.h"
 #include "utils/timestamp.h"
+#include "portability/instr_time.h"
 
 /* bogus ... these externs should be in a header file */
 extern Datum pg_stat_get_numscans(PG_FUNCTION_ARGS);
@@ -35,6 +36,7 @@ extern Datum pg_stat_get_live_tuples(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_dead_tuples(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_blocks_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_blocks_hit(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_blocks_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_last_vacuum_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_last_autovacuum_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_last_analyze_time(PG_FUNCTION_ARGS);
@@ -67,6 +69,7 @@ extern Datum pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_db_blocks_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS);
@@ -102,6 +105,7 @@ extern Datum pg_stat_get_xact_tuples_deleted(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_blocks_hit(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_xact_blocks_time(PG_FUNCTION_ARGS);
 
 extern Datum pg_stat_get_xact_function_calls(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_function_time(PG_FUNCTION_ARGS);
@@ -292,6 +296,22 @@ pg_stat_get_blocks_hit(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_blocks_time(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	double		result;
+	PgStat_StatTabEntry *tabentry;
+
+	if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+		result = 0;
+	else
+		/* Cast overflows in about 300'000 years of io time */
+		result = (int64) INSTR_TIME_GET_MICROSEC(tabentry->blocks_time);
+
+	PG_RETURN_INT64(result);
+}
+
+Datum
 pg_stat_get_last_vacuum_time(PG_FUNCTION_ARGS)
 {
 	Oid			relid = PG_GETARG_OID(0);
@@ -1120,6 +1140,22 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
 
 
 Datum
+pg_stat_get_db_blocks_time(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) INSTR_TIME_GET_MICROSEC(dbentry->blocks_time);
+
+	PG_RETURN_INT64(result);
+}
+
+
+Datum
 pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
 {
 	Oid			dbid = PG_GETARG_OID(0);
@@ -1565,6 +1601,21 @@ pg_stat_get_xact_blocks_hit(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_xact_blocks_time(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_TableStatus *tabentry;
+
+	if ((tabentry = find_tabstat_entry(relid)) == NULL)
+		result = 0;
+	else
+		result = (int64) INSTR_TIME_GET_MICROSEC(tabentry->t_counts.t_blocks_time);
+
+	PG_RETURN_INT64(result);
+}
+
+Datum
 pg_stat_get_xact_function_calls(PG_FUNCTION_ARGS)
 {
 	Oid			funcid = PG_GETARG_OID(0);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 7df5292..850f595 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1017,6 +1017,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_iotiming", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing information for database IO activity."),
+			NULL
+		},
+		&track_iotiming,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, STATS_COLLECTOR,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 400c52b..8cdc0ab 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -420,6 +420,7 @@
 
 #track_activities = on
 #track_counts = on
+#track_iotiming = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024 	# (change requires restart)
 #update_process_title = on
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 8700d0d..4e58c8c 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2672,6 +2672,12 @@ DATA(insert OID = 3063 ( pg_stat_get_buf_fsync_backend PGNSP PGUID 12 1 0 0 0 f
 DESCR("statistics: number of backend buffer writes that did their own fsync");
 DATA(insert OID = 2859 ( pg_stat_get_buf_alloc			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_alloc _null_ _null_ _null_ ));
 DESCR("statistics: number of buffer allocations");
+DATA(insert OID = 3947 (  pg_stat_get_blocks_time		PGNSP PGUID 12 1 0 0 0 f f f t f f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_blocks_time _null_ _null_ _null_ ));
+DESCR("statistics: duration of block reads");
+DATA(insert OID = 3948 (  pg_stat_get_db_blocks_time	PGNSP PGUID 12 1 0 0 0 f f f t f f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_db_blocks_time _null_ _null_ _null_ ));
+DESCR("statistics: duration of block reads for database");
+DATA(insert OID = 3949 (  pg_stat_get_xact_blocks_time			PGNSP PGUID 12 1 0 0 0 f f f t f f v 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_time _null_ _null_ _null_ ));
+DESCR("statistics: duration of block reads in current transaction");
 
 DATA(insert OID = 2978 (  pg_stat_get_function_calls		PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_function_calls _null_ _null_ _null_ ));
 DESCR("statistics: number of function calls");
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 084302e..fea87ff 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -26,6 +26,8 @@ typedef struct BufferUsage
 	long		local_blks_written;		/* # of local disk blocks written */
 	long		temp_blks_read; /* # of temp blocks read */
 	long		temp_blks_written;		/* # of temp blocks written */
+	instr_time	time_read;	/* time spent reading */
+	instr_time	time_write;	/* time spent writing */
 } BufferUsage;
 
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1281bd8..175b5df 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -97,6 +97,7 @@ typedef struct PgStat_TableCounts
 
 	PgStat_Counter t_blocks_fetched;
 	PgStat_Counter t_blocks_hit;
+	instr_time     t_blocks_time;
 } PgStat_TableCounts;
 
 /* Possible targets for resetting cluster-wide shared values */
@@ -522,6 +523,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_xact_rollback;
 	PgStat_Counter n_blocks_fetched;
 	PgStat_Counter n_blocks_hit;
+	instr_time	blocks_time;
 	PgStat_Counter n_tuples_returned;
 	PgStat_Counter n_tuples_fetched;
 	PgStat_Counter n_tuples_inserted;
@@ -573,6 +575,7 @@ typedef struct PgStat_StatTabEntry
 
 	PgStat_Counter blocks_fetched;
 	PgStat_Counter blocks_hit;
+	instr_time	blocks_time;
 
 	TimestampTz vacuum_timestamp;		/* user initiated vacuum */
 	PgStat_Counter vacuum_count;
@@ -816,6 +819,13 @@ extern void pgstat_initstats(Relation rel);
 		if ((rel)->pgstat_info != NULL)								\
 			(rel)->pgstat_info->t_counts.t_blocks_hit++;			\
 	} while (0)
+#define pgstat_count_buffer_time(rel, io)							\
+	do {															\
+		if ((rel)->pgstat_info != NULL)								\
+			INSTR_TIME_ADD(											\
+				(rel)->pgstat_info->t_counts.t_blocks_time,			\
+				(io));												\
+	} while (0)
 
 extern void pgstat_count_heap_insert(Relation rel, int n);
 extern void pgstat_count_heap_update(Relation rel, bool hot);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index de1bbd0..d72bad9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -48,6 +48,7 @@ extern PGDLLIMPORT int NBuffers;
 extern bool zero_damaged_pages;
 extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
+extern bool track_iotiming;
 extern int	target_prefetch_pages;
 
 /* in buf_init.c */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 0275a0e..f0aa424 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1296,7 +1296,7 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
  pg_stat_all_indexes             | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, pg_stat_get_numscans(i.oid) AS idx_scan, pg_stat_get_tuples_returned(i.oid) AS idx_tup_read, pg_stat_get_tuples_fetched(i.oid) AS idx_tup_fetch FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
  pg_stat_all_tables              | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, pg_stat_get_numscans(c.oid) AS seq_scan, pg_stat_get_tuples_returned(c.oid) AS seq_tup_read, (sum(pg_stat_get_numscans(i.indexrelid)))::bigint AS idx_scan, ((sum(pg_stat_get_tuples_fetched(i.indexrelid)))::bigint + pg_stat_get_tuples_fetched(c.oid)) AS idx_tup_fetch, pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins, pg_stat_get_tuples_updated(c.oid) AS n_tup_upd, pg_stat_get_tuples_deleted(c.oid) AS n_tup_del, pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd, pg_stat_get_live_tuples(c.oid) AS n_live_tup, pg_stat_get_dead_tuples(c.oid) AS n_dead_tup, pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum, pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum, pg_stat_get_last_analyze_time(c.oid) AS last_analyze, pg_stat_get_last_autoanalyze_time(c.oid) AS last_autoanalyze, pg_stat_get_vacuum_count(c.oid) AS vacuum_count, pg_stat_get_autovacuum_count(c.oid) AS autovacuum_count, pg_stat_get_analyze_count(c.oid) AS analyze_count, pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count FROM ((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname;
  pg_stat_bgwriter                | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
- pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_temp_files(d.oid) AS temp_files, pg_stat_get_db_temp_bytes(d.oid) AS temp_bytes, pg_stat_get_db_deadlocks(d.oid) AS deadlocks, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
+ pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_blocks_time(d.oid) AS blks_time, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_temp_files(d.oid) AS temp_files, pg_stat_get_db_temp_bytes(d.oid) AS temp_bytes, pg_stat_get_db_deadlocks(d.oid) AS deadlocks, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
  pg_stat_database_conflicts      | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
  pg_stat_replication             | SELECT s.pid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_hostname, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.replay_location, w.sync_priority, w.sync_state FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, waiting, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port), pg_authid u, pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) WHERE ((s.usesysid = u.oid) AND (s.pid = w.pid));
  pg_stat_sys_indexes             | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
@@ -1308,15 +1308,15 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
  pg_stat_xact_sys_tables         | SELECT pg_stat_xact_all_tables.relid, pg_stat_xact_all_tables.schemaname, pg_stat_xact_all_tables.relname, pg_stat_xact_all_tables.seq_scan, pg_stat_xact_all_tables.seq_tup_read, pg_stat_xact_all_tables.idx_scan, pg_stat_xact_all_tables.idx_tup_fetch, pg_stat_xact_all_tables.n_tup_ins, pg_stat_xact_all_tables.n_tup_upd, pg_stat_xact_all_tables.n_tup_del, pg_stat_xact_all_tables.n_tup_hot_upd FROM pg_stat_xact_all_tables WHERE ((pg_stat_xact_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_xact_all_tables.schemaname ~ '^pg_toast'::text));
  pg_stat_xact_user_functions     | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_xact_function_calls(p.oid) AS calls, (pg_stat_get_xact_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_xact_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_xact_function_calls(p.oid) IS NOT NULL));
  pg_stat_xact_user_tables        | SELECT pg_stat_xact_all_tables.relid, pg_stat_xact_all_tables.schemaname, pg_stat_xact_all_tables.relname, pg_stat_xact_all_tables.seq_scan, pg_stat_xact_all_tables.seq_tup_read, pg_stat_xact_all_tables.idx_scan, pg_stat_xact_all_tables.idx_tup_fetch, pg_stat_xact_all_tables.n_tup_ins, pg_stat_xact_all_tables.n_tup_upd, pg_stat_xact_all_tables.n_tup_del, pg_stat_xact_all_tables.n_tup_hot_upd FROM pg_stat_xact_all_tables WHERE ((pg_stat_xact_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_xact_all_tables.schemaname !~ '^pg_toast'::text));
- pg_statio_all_indexes           | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, (pg_stat_get_blocks_fetched(i.oid) - pg_stat_get_blocks_hit(i.oid)) AS idx_blks_read, pg_stat_get_blocks_hit(i.oid) AS idx_blks_hit FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
- pg_statio_all_sequences         | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS blks_read, pg_stat_get_blocks_hit(c.oid) AS blks_hit FROM (pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = 'S'::"char");
- pg_statio_all_tables            | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS heap_blks_read, pg_stat_get_blocks_hit(c.oid) AS heap_blks_hit, (sum((pg_stat_get_blocks_fetched(i.indexrelid) - pg_stat_get_blocks_hit(i.indexrelid))))::bigint AS idx_blks_read, (sum(pg_stat_get_blocks_hit(i.indexrelid)))::bigint AS idx_blks_hit, (pg_stat_get_blocks_fetched(t.oid) - pg_stat_get_blocks_hit(t.oid)) AS toast_blks_read, pg_stat_get_blocks_hit(t.oid) AS toast_blks_hit, (pg_stat_get_blocks_fetched(x.oid) - pg_stat_get_blocks_hit(x.oid)) AS tidx_blks_read, pg_stat_get_blocks_hit(x.oid) AS tidx_blks_hit FROM ((((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_class t ON ((c.reltoastrelid = t.oid))) LEFT JOIN pg_class x ON ((t.reltoastidxid = x.oid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname, t.oid, x.oid;
- pg_statio_sys_indexes           | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_indexes.schemaname ~ '^pg_toast'::text));
- pg_statio_sys_sequences         | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_sequences.schemaname ~ '^pg_toast'::text));
- pg_statio_sys_tables            | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_tables.schemaname ~ '^pg_toast'::text));
- pg_statio_user_indexes          | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_indexes.schemaname !~ '^pg_toast'::text));
- pg_statio_user_sequences        | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_sequences.schemaname !~ '^pg_toast'::text));
- pg_statio_user_tables           | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_tables.schemaname !~ '^pg_toast'::text));
+ pg_statio_all_indexes           | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, (pg_stat_get_blocks_fetched(i.oid) - pg_stat_get_blocks_hit(i.oid)) AS idx_blks_read, pg_stat_get_blocks_hit(i.oid) AS idx_blks_hit, pg_stat_get_blocks_time(i.oid) AS idx_blks_time FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
+ pg_statio_all_sequences         | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS blks_read, pg_stat_get_blocks_hit(c.oid) AS blks_hit, pg_stat_get_blocks_time(c.oid) AS blks_time FROM (pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = 'S'::"char");
+ pg_statio_all_tables            | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS heap_blks_read, pg_stat_get_blocks_hit(c.oid) AS heap_blks_hit, pg_stat_get_blocks_time(c.oid) AS heap_blks_time, (sum((pg_stat_get_blocks_fetched(i.indexrelid) - pg_stat_get_blocks_hit(i.indexrelid))))::bigint AS idx_blks_read, (sum(pg_stat_get_blocks_hit(i.indexrelid)))::bigint AS idx_blks_hit, (sum(pg_stat_get_blocks_time(i.indexrelid)))::bigint AS idx_blks_time, (pg_stat_get_blocks_fetched(t.oid) - pg_stat_get_blocks_hit(t.oid)) AS toast_blks_read, pg_stat_get_blocks_hit(t.oid) AS toast_blks_hit, pg_stat_get_blocks_time(t.oid) AS toast_blks_time, (pg_stat_get_blocks_fetched(x.oid) - pg_stat_get_blocks_hit(x.oid)) AS tidx_blks_read, pg_stat_get_blocks_hit(x.oid) AS tidx_blks_hit, pg_stat_get_blocks_time(x.oid) AS tidx_blks_time FROM ((((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_class t ON ((c.reltoastrelid = t.oid))) LEFT JOIN pg_class x ON ((t.reltoastidxid = x.oid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname, t.oid, x.oid;
+ pg_statio_sys_indexes           | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit, pg_statio_all_indexes.idx_blks_time FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_indexes.schemaname ~ '^pg_toast'::text));
+ pg_statio_sys_sequences         | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit, pg_statio_all_sequences.blks_time FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_sequences.schemaname ~ '^pg_toast'::text));
+ pg_statio_sys_tables            | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.heap_blks_time, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.idx_blks_time, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.toast_blks_time, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit, pg_statio_all_tables.tidx_blks_time FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_tables.schemaname ~ '^pg_toast'::text));
+ pg_statio_user_indexes          | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit, pg_statio_all_indexes.idx_blks_time FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_indexes.schemaname !~ '^pg_toast'::text));
+ pg_statio_user_sequences        | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit, pg_statio_all_sequences.blks_time FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_sequences.schemaname !~ '^pg_toast'::text));
+ pg_statio_user_tables           | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.heap_blks_time, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.idx_blks_time, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.toast_blks_time, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit, pg_statio_all_tables.tidx_blks_time FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_tables.schemaname !~ '^pg_toast'::text));
  pg_stats                        | SELECT n.nspname AS schemaname, c.relname AS tablename, a.attname, s.stainherit AS inherited, s.stanullfrac AS null_frac, s.stawidth AS avg_width, s.stadistinct AS n_distinct, CASE WHEN (s.stakind1 = ANY (ARRAY[1, 4])) THEN s.stavalues1 WHEN (s.stakind2 = ANY (ARRAY[1, 4])) THEN s.stavalues2 WHEN (s.stakind3 = ANY (ARRAY[1, 4])) THEN s.stavalues3 WHEN (s.stakind4 = ANY (ARRAY[1, 4])) THEN s.stavalues4 ELSE NULL::anyarray END AS most_common_vals, CASE WHEN (s.stakind1 = ANY (ARRAY[1, 4])) THEN s.stanumbers1 WHEN (s.stakind2 = ANY (ARRAY[1, 4])) THEN s.stanumbers2 WHEN (s.stakind3 = ANY (ARRAY[1, 4])) THEN s.stanumbers3 WHEN (s.stakind4 = ANY (ARRAY[1, 4])) THEN s.stanumbers4 ELSE NULL::real[] END AS most_common_freqs, CASE WHEN (s.stakind1 = 2) THEN s.stavalues1 WHEN (s.stakind2 = 2) THEN s.stavalues2 WHEN (s.stakind3 = 2) THEN s.stavalues3 WHEN (s.stakind4 = 2) THEN s.stavalues4 ELSE NULL::anyarray END AS histogram_bounds, CASE WHEN (s.stakind1 = 3) THEN s.stanumbers1[1] WHEN (s.stakind2 = 3) THEN s.stanumbers2[1] WHEN (s.stakind3 = 3) THEN s.stanumbers3[1] WHEN (s.stakind4 = 3) THEN s.stanumbers4[1] ELSE NULL::real END AS correlation FROM (((pg_statistic s JOIN pg_class c ON ((c.oid = s.starelid))) JOIN pg_attribute a ON (((c.oid = a.attrelid) AND (a.attnum = s.staattnum)))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE ((NOT a.attisdropped) AND has_column_privilege(c.oid, a.attnum, 'select'::text));
  pg_tables                       | SELECT n.nspname AS schemaname, c.relname AS tablename, pg_get_userbyid(c.relowner) AS tableowner, t.spcname AS tablespace, c.relhasindex AS hasindexes, c.relhasrules AS hasrules, c.relhastriggers AS hastriggers FROM ((pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) LEFT JOIN pg_tablespace t ON ((t.oid = c.reltablespace))) WHERE (c.relkind = 'r'::"char");
  pg_timezone_abbrevs             | SELECT pg_timezone_abbrevs.abbrev, pg_timezone_abbrevs.utc_offset, pg_timezone_abbrevs.is_dst FROM pg_timezone_abbrevs() pg_timezone_abbrevs(abbrev, utc_offset, is_dst);

io-stats-statment.v3.patchtext/x-patch; name=io-stats-statment.v3.patchDownload

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index e086fd8..971773e 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -4,7 +4,8 @@ MODULE_big = pg_stat_statements
 OBJS = pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.0.sql pg_stat_statements--unpackaged--1.0.sql
+DATA = pg_stat_statements--1.1.sql pg_stat_statements--1.0--1.1.sql \
+       pg_stat_statements--unpackaged--1.0.sql
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
new file mode 100644
index 0000000..20bd5e3
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
@@ -0,0 +1,26 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.1'" to load this file. \quit
+
+CREATE OR REPLACE FUNCTION pg_stat_statements(
+    OUT userid oid,
+    OUT dbid oid,
+    OUT query text,
+    OUT calls int8,
+    OUT total_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT time_read float8,
+    OUT time_write float8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.0.sql b/contrib/pg_stat_statements/pg_stat_statements--1.0.sql
deleted file mode 100644
index 5294a01..0000000
--- a/contrib/pg_stat_statements/pg_stat_statements--1.0.sql
+++ /dev/null
@@ -1,39 +0,0 @@
-/* contrib/pg_stat_statements/pg_stat_statements--1.0.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit
-
--- Register functions.
-CREATE FUNCTION pg_stat_statements_reset()
-RETURNS void
-AS 'MODULE_PATHNAME'
-LANGUAGE C;
-
-CREATE FUNCTION pg_stat_statements(
-    OUT userid oid,
-    OUT dbid oid,
-    OUT query text,
-    OUT calls int8,
-    OUT total_time float8,
-    OUT rows int8,
-    OUT shared_blks_hit int8,
-    OUT shared_blks_read int8,
-    OUT shared_blks_written int8,
-    OUT local_blks_hit int8,
-    OUT local_blks_read int8,
-    OUT local_blks_written int8,
-    OUT temp_blks_read int8,
-    OUT temp_blks_written int8
-)
-RETURNS SETOF record
-AS 'MODULE_PATHNAME'
-LANGUAGE C;
-
--- Register a view on the function for ease of use.
-CREATE VIEW pg_stat_statements AS
-  SELECT * FROM pg_stat_statements();
-
-GRANT SELECT ON pg_stat_statements TO PUBLIC;
-
--- Don't want this to be available to non-superusers.
-REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
new file mode 100644
index 0000000..8bd2868
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
@@ -0,0 +1,41 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit
+
+-- Register functions.
+CREATE FUNCTION pg_stat_statements_reset()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION pg_stat_statements(
+    OUT userid oid,
+    OUT dbid oid,
+    OUT query text,
+    OUT calls int8,
+    OUT total_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT time_read float8,
+    OUT time_write float8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+-- Register a view on the function for ease of use.
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements();
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
+
+-- Don't want this to be available to non-superusers.
+REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 434aa71..cdcf4e8 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -83,6 +83,8 @@ typedef struct Counters
 	int64		local_blks_written;		/* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;		/* # of temp blocks written */
+	double		time_read;		/* time spent reading in seconds */
+	double		time_write;		/* time spent writing in seconds */
 	double		usage;			/* usage factor */
 } Counters;
 
@@ -616,9 +618,9 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString,
 		instr_time	start;
 		instr_time	duration;
 		uint64		rows = 0;
-		BufferUsage bufusage;
+		BufferUsage bufusage_start, bufusage;
 
-		bufusage = pgBufferUsage;
+		bufusage_start = pgBufferUsage;
 		INSTR_TIME_SET_CURRENT(start);
 
 		nested_level++;
@@ -649,21 +651,25 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString,
 
 		/* calc differences of buffer counters. */
 		bufusage.shared_blks_hit =
-			pgBufferUsage.shared_blks_hit - bufusage.shared_blks_hit;
+			pgBufferUsage.shared_blks_hit - bufusage_start.shared_blks_hit;
 		bufusage.shared_blks_read =
-			pgBufferUsage.shared_blks_read - bufusage.shared_blks_read;
+			pgBufferUsage.shared_blks_read - bufusage_start.shared_blks_read;
 		bufusage.shared_blks_written =
-			pgBufferUsage.shared_blks_written - bufusage.shared_blks_written;
+			pgBufferUsage.shared_blks_written - bufusage_start.shared_blks_written;
 		bufusage.local_blks_hit =
-			pgBufferUsage.local_blks_hit - bufusage.local_blks_hit;
+			pgBufferUsage.local_blks_hit - bufusage_start.local_blks_hit;
 		bufusage.local_blks_read =
-			pgBufferUsage.local_blks_read - bufusage.local_blks_read;
+			pgBufferUsage.local_blks_read - bufusage_start.local_blks_read;
 		bufusage.local_blks_written =
-			pgBufferUsage.local_blks_written - bufusage.local_blks_written;
+			pgBufferUsage.local_blks_written - bufusage_start.local_blks_written;
 		bufusage.temp_blks_read =
-			pgBufferUsage.temp_blks_read - bufusage.temp_blks_read;
+			pgBufferUsage.temp_blks_read - bufusage_start.temp_blks_read;
 		bufusage.temp_blks_written =
-			pgBufferUsage.temp_blks_written - bufusage.temp_blks_written;
+			pgBufferUsage.temp_blks_written - bufusage_start.temp_blks_written;
+		bufusage.time_read = pgBufferUsage.time_read;
+		INSTR_TIME_SUBTRACT(bufusage.time_read, bufusage_start.time_read);
+		bufusage.time_write = pgBufferUsage.time_write;
+		INSTR_TIME_SUBTRACT(bufusage.time_write, bufusage_start.time_write);
 
 		pgss_store(queryString, INSTR_TIME_GET_DOUBLE(duration), rows,
 				   &bufusage);
@@ -772,6 +778,8 @@ pgss_store(const char *query, double total_time, uint64 rows,
 		e->counters.local_blks_written += bufusage->local_blks_written;
 		e->counters.temp_blks_read += bufusage->temp_blks_read;
 		e->counters.temp_blks_written += bufusage->temp_blks_written;
+		e->counters.time_read +=  INSTR_TIME_GET_DOUBLE(bufusage->time_read);
+		e->counters.time_write += INSTR_TIME_GET_DOUBLE(bufusage->time_write);
 		e->counters.usage += usage;
 		SpinLockRelease(&e->mutex);
 	}
@@ -793,7 +801,7 @@ pg_stat_statements_reset(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
-#define PG_STAT_STATEMENTS_COLS		14
+#define PG_STAT_STATEMENTS_COLS		16
 
 /*
  * Retrieve statement statistics.
@@ -893,6 +901,8 @@ pg_stat_statements(PG_FUNCTION_ARGS)
 		values[i++] = Int64GetDatumFast(tmp.local_blks_written);
 		values[i++] = Int64GetDatumFast(tmp.temp_blks_read);
 		values[i++] = Int64GetDatumFast(tmp.temp_blks_written);
+		values[i++] = Float8GetDatumFast(tmp.time_read);
+		values[i++] = Float8GetDatumFast(tmp.time_write);
 
 		Assert(i == PG_STAT_STATEMENTS_COLS);
 
diff --git a/contrib/pg_stat_statements/pg_stat_statements.control b/contrib/pg_stat_statements/pg_stat_statements.control
index 6f9a947..428fbb2 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.control
+++ b/contrib/pg_stat_statements/pg_stat_statements.control
@@ -1,5 +1,5 @@
 # pg_stat_statements extension
 comment = 'track execution statistics of all SQL statements executed'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/pg_stat_statements'
 relocatable = true
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 5a0230c..2d60bdb 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -141,6 +141,20 @@
       <entry>Total number of temp blocks writes by the statement</entry>
      </row>
 
+     <row>
+      <entry><structfield>time_read</structfield></entry>
+      <entry><type>double precision</type></entry>
+      <entry></entry>
+      <entry>Total time spent by the statement for reading blocks, in seconds</entry>
+     </row>
+
+     <row>
+      <entry><structfield>time_write</structfield></entry>
+      <entry><type>double precision</type></entry>
+      <entry></entry>
+      <entry>Total time spent by the statement for writing out dirty blocks, in seconds</entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>

#21

Ants Aasma

ants.aasma@eesti.ee

almost 14 years ago

In reply to: Greg Smith (#20)

Re: Patch: add timing of buffer I/O requests

On Wed, Feb 22, 2012 at 4:43 PM, Greg Smith <greg@2ndquadrant.com> wrote:

Attached are updated versions of this feature without the pg_test_timing
tool part, since I broke that out into another discussion thread. I've
split the part that updates pg_stat_statistics out from the main feature
too, separate patch attached to here (but I'm not reviewing that yet). Lots
of bitrot since this was submitted, and yes I noticed that I've almost
recreated earlier versions of this patch--by splitting off the parts that
were developed later.

Thanks for the review and splitting. Sorry I didn't fix up the bit rot myself.

Earlier discussion of this got side tracked on a few things, partly my
fault. It's worth taking a look at what this provides before judging it too
much. It can demo well.

The stated purpose is helping figure out what relations are gobbling up the
most access time, presumably to optimize them and/or the storage they are
on. "What do I put onto SSD" is surely a popular request nowadays.

I should have stated the purpose more clearly. The original reason for
developing this patch was to figure out "what queries are taking the
most time and why", specifically in the case where OS memory is a lot
larger than shared_buffers. Basically the following query to get a
quick overview where the bottlenecks are:
SELECT query, total_time, (time_read+time_write)/total_time AS
io_fraction FROM pg_stat_statements ORDER BY total_time DESC LIMIT 20;

This of course hugely benefits from Peter's pg_stat_statements
normalization patch.

Tracking timings per relation was actually an afterthought.

Now, the first critical question to ask is "what additional information is
this providing above the existing counters?" After all, it's possible to
tell pgbench_accounts is the hotspot just from comparing heap_blks_read,
right?

Like I said above, I find it mostly useful to see what is missing the
OS cache. With memory being as cheap as it is, a reasonably priced
server can have 128G of memory, while max recommended value for
shared_buffers is 8GB. It's quite likely to have tables that fit into
OS cache but not into shared_buffers, but it's not trivial to figure
out which those are.

This run looks useful at providing the data wished for--that read times are
slower per capita from the accounts table. The first time I tried this I
got a bizarre high number for pgbench_branches.heap_blks_time ; I'm not sure
how reliable this is yet. One problem that might be easy to fix is that the
write timing info doesn't show in any of these system views, only in EXPLAIN
and statement level ones.

I'm not sure about the source of the huge number, might instability in
the clock source. Have you tried running the monotonicity check for a
longer period while the system is under load? Another issue with the
current timing code is that gettimeofday isn't guaranteed to be
monotonic anyway, things like NTP adjustments can make time go
backwards. clock_gettime with CLOCK_MONOTONIC_RAW would be better, but
that's linux specific :(

The reason why I didn't add write timings to relation stats is that I
couldn't figure out what the semantics should be. It could be either
"time spent waiting for this relations blocks to be written out" or
"time spent waiting for some other relations blocks to be written out
to free space for this relations block" or maybe distribute the cost,
background writes could be included or excluded. Writes usually return
quickly, unless lots of possibly unrelated writes have dirtied enough
of OS cache, etc. I figured that what ever choices I made, they
wouldn't really help anyone diagnose anything. Having global write
timings in pg_stat_bgwriter might be useful, but I feel that is
something for another patch.

I still think a full wait timing interface is the right long-term direction
here. It's hard to reject this idea when it seems to be working right now
though, while more comprehensive wait storage is still at least a release
off. Opinions welcome, I'm still juggling this around now that I have it
working again.

I agree that wait timing interface is the right direction. I have
thought a bit about it and could share some ideas - maybe I should
create a wiki page where the general design could be hashed out?

Anyway, the user visible information from this patch should be trivial
to extract from a general wait timing framework. Pushing my own agenda
a bit - having this patch in the current release would help to get
some field experience on any issues surrounding timing :)

Some implementation notes. This currently fails regression test
create_function_3, haven't looked into why yet.

I'll take a look at it.

Thanks again.

--
Ants Aasma

#22

Ants Aasma

ants.aasma@eesti.ee

almost 14 years ago

In reply to: Ants Aasma (#21)

2 attachment(s)

Re: Patch: add timing of buffer I/O requests

On Wed, Feb 22, 2012 at 6:35 PM, Ants Aasma <ants.aasma@eesti.ee> wrote:

Some implementation notes. This currently fails regression test
create_function_3, haven't looked into why yet.

I'll take a look at it.

The failure was due to leakproof changes to pgproc. Attached patches
are adjusted accordingly and rebased over Robert's blocks dirtied
patch.

Cheers,
Ants Aasma

Attachments:

io-stats.v4.patchtext/x-patch; charset=US-ASCII; name=io-stats.v4.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6e1378a..a275535 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4269,6 +4269,25 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-iotiming" xreflabel="track_iotiming">
+      <term><varname>track_iotiming</varname> (<type>boolean</type>)</term>
+      <indexterm>
+       <primary><varname>track_iotiming</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Enables timing of database I/O calls.
+        This parameter is off by default, because it may cause significant
+        overhead if the platform doesn't support fast timing information.
+        Only superusers can change this setting.
+       </para>
+       <para>
+        You can use the <xref linkend="pgtesttiming"> tool to find out the
+        overhead of timing on your system.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)</term>
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index cb13c8e..3821d37 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -116,7 +116,8 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
    <productname>PostgreSQL</productname>'s <firstterm>statistics collector</>
    is a subsystem that supports collection and reporting of information about
    server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
+   and indexes in both disk-block and individual-row terms and time disk-block
+   accesses.  It also tracks
    the total number of rows in each table, and information about vacuum and
    analyze actions for each table.  It can also count calls to user-defined
    functions and the total time spent in each one.
@@ -145,6 +146,11 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
   </para>
 
   <para>
+   The parameter <xref linkend="guc-track-iotiming"> enables timing of I/O
+   requests.
+  </para>
+
+  <para>
    The parameter <xref linkend="guc-track-functions"> enables tracking of
    usage of user-defined functions.
   </para>
@@ -403,8 +409,9 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
       blocks read from that table, number of buffer hits, numbers of
       disk blocks read and buffer hits in all indexes of that table,
       numbers of disk blocks read and buffer hits from that table's
-      auxiliary TOAST table (if any), and numbers of disk blocks read
-      and buffer hits for the TOAST table's index.
+      auxiliary TOAST table (if any), numbers of disk blocks read
+      and buffer hits for the TOAST table's index and microseconds
+      spent reading the blocks for each category.
       </entry>
      </row>
 
@@ -424,7 +431,8 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
       <entry><structname>pg_statio_all_indexes</><indexterm><primary>pg_statio_all_indexes</primary></indexterm></entry>
       <entry>For each index in the current database,
       the table and index OID, schema, table and index name,
-      numbers of disk blocks read and buffer hits in that index.
+      numbers of disk blocks read, microseconds spent reading the blocks
+      and buffer hits in that index.
       </entry>
      </row>
 
@@ -523,7 +531,10 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
    handles disk I/O, data that is not in the
    <productname>PostgreSQL</> buffer cache might still reside in the
    kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
+   requiring a physical read.
+   Timing info shows how much user queries were delayed by buffer reads
+   in aggregate.
+   Users interested in obtaining more
    detailed information on <productname>PostgreSQL</> I/O behavior are
    advised to use the <productname>PostgreSQL</> statistics collector
    in combination with operating system utilities that allow insight
@@ -795,6 +806,15 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
      </row>
 
      <row>
+      <entry><literal><function>pg_stat_get_db_blocks_time</function>(<type>oid</type>)</literal></entry>
+      <entry><type>bigint</type></entry>
+      <entry>
+       Microseconds spent across all backends waiting for disk block fetch
+       requests for database
+      </entry>
+     </row>
+
+     <row>
       <entry><literal><function>pg_stat_get_db_tuples_returned</function>(<type>oid</type>)</literal></entry>
       <entry><type>bigint</type></entry>
       <entry>
@@ -1010,6 +1030,15 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
      </row>
 
      <row>
+      <entry><literal><function>pg_stat_get_blocks_time</function>(<type>oid</type>)</literal></entry>
+      <entry><type>bigint</type></entry>
+      <entry>
+       Microseconds spent across all backends waiting for disk block fetch requests
+       for table or index
+      </entry>
+     </row>
+
+     <row>
       <entry><literal><function>pg_stat_get_last_vacuum_time</function>(<type>oid</type>)</literal></entry>
       <entry><type>timestamptz</type></entry>
       <entry>
@@ -1151,6 +1180,14 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
      </row>
 
      <row>
+      <entry><literal><function>pg_stat_get_xact_blocks_time</function>(<type>oid</type>)</literal></entry>
+      <entry><type>bigint</type></entry>
+      <entry>
+       Microseconds spent waiting for  disk block requests for table or index, in the current transaction
+      </entry>
+     </row>
+
+     <row>
        <!-- See also the entry for this in func.sgml -->
       <entry><literal><function>pg_backend_pid()</function></literal></entry>
       <entry><type>integer</type></entry>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 30b0bd0..e0999d5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -415,15 +415,19 @@ CREATE VIEW pg_statio_all_tables AS
             pg_stat_get_blocks_fetched(C.oid) -
                     pg_stat_get_blocks_hit(C.oid) AS heap_blks_read,
             pg_stat_get_blocks_hit(C.oid) AS heap_blks_hit,
+            pg_stat_get_blocks_time(C.oid) AS heap_blks_time,
             sum(pg_stat_get_blocks_fetched(I.indexrelid) -
                     pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_read,
             sum(pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_hit,
+            sum(pg_stat_get_blocks_time(I.indexrelid))::bigint AS idx_blks_time,
             pg_stat_get_blocks_fetched(T.oid) -
                     pg_stat_get_blocks_hit(T.oid) AS toast_blks_read,
             pg_stat_get_blocks_hit(T.oid) AS toast_blks_hit,
+            pg_stat_get_blocks_time(T.oid) AS toast_blks_time,
             pg_stat_get_blocks_fetched(X.oid) -
                     pg_stat_get_blocks_hit(X.oid) AS tidx_blks_read,
-            pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit
+            pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit,
+            pg_stat_get_blocks_time(X.oid) AS tidx_blks_time
     FROM pg_class C LEFT JOIN
             pg_index I ON C.oid = I.indrelid LEFT JOIN
             pg_class T ON C.reltoastrelid = T.oid LEFT JOIN
@@ -477,7 +481,8 @@ CREATE VIEW pg_statio_all_indexes AS
             I.relname AS indexrelname,
             pg_stat_get_blocks_fetched(I.oid) -
                     pg_stat_get_blocks_hit(I.oid) AS idx_blks_read,
-            pg_stat_get_blocks_hit(I.oid) AS idx_blks_hit
+            pg_stat_get_blocks_hit(I.oid) AS idx_blks_hit,
+            pg_stat_get_blocks_time(I.oid) AS idx_blks_time
     FROM pg_class C JOIN
             pg_index X ON C.oid = X.indrelid JOIN
             pg_class I ON I.oid = X.indexrelid
@@ -501,7 +506,8 @@ CREATE VIEW pg_statio_all_sequences AS
             C.relname AS relname,
             pg_stat_get_blocks_fetched(C.oid) -
                     pg_stat_get_blocks_hit(C.oid) AS blks_read,
-            pg_stat_get_blocks_hit(C.oid) AS blks_hit
+            pg_stat_get_blocks_hit(C.oid) AS blks_hit,
+            pg_stat_get_blocks_time(C.oid) AS blks_time
     FROM pg_class C
             LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
     WHERE C.relkind = 'S';
@@ -570,6 +576,7 @@ CREATE VIEW pg_stat_database AS
             pg_stat_get_db_blocks_fetched(D.oid) -
                     pg_stat_get_db_blocks_hit(D.oid) AS blks_read,
             pg_stat_get_db_blocks_hit(D.oid) AS blks_hit,
+            pg_stat_get_db_blocks_time(D.oid) AS blks_time,
             pg_stat_get_db_tuples_returned(D.oid) AS tup_returned,
             pg_stat_get_db_tuples_fetched(D.oid) AS tup_fetched,
             pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 93b1f34..b848c30 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1191,9 +1191,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 									 usage->local_blks_written > 0);
 			bool		has_temp = (usage->temp_blks_read > 0 ||
 									usage->temp_blks_written > 0);
+			bool		has_timing = (!INSTR_TIME_IS_ZERO(usage->time_read) ||
+									  !INSTR_TIME_IS_ZERO(usage->time_write));
 
 			/* Show only positive counter values. */
-			if (has_shared || has_local || has_temp)
+			if (has_shared || has_local || has_temp || has_timing)
 			{
 				appendStringInfoSpaces(es->str, es->indent * 2);
 				appendStringInfoString(es->str, "Buffers:");
@@ -1244,6 +1246,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
 						appendStringInfo(es->str, " written=%ld",
 										 usage->temp_blks_written);
 				}
+				if (has_timing)
+				{
+					appendStringInfoString(es->str, " timing");
+					if (!INSTR_TIME_IS_ZERO(usage->time_read)) {
+						appendStringInfo(es->str, " read=%0.2f",
+										 INSTR_TIME_GET_MILLISEC(usage->time_read));
+					}
+					if (!INSTR_TIME_IS_ZERO(usage->time_write)) {
+						appendStringInfo(es->str, " write=%0.2f",
+										 INSTR_TIME_GET_MILLISEC(usage->time_write));
+					}
+				}
 				appendStringInfoChar(es->str, '\n');
 			}
 		}
@@ -1259,6 +1273,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
 			ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
 			ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
+			ExplainPropertyFloat("Read Waits", INSTR_TIME_GET_MILLISEC(usage->time_read), 3, es);
+			ExplainPropertyFloat("Write Waits", INSTR_TIME_GET_MILLISEC(usage->time_write), 3, es);
 		}
 	}
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6e9f450..92e56d1 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -145,4 +145,6 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	INSTR_TIME_ACCUM_DIFF(dst->time_read, add->time_read, sub->time_read);
+	INSTR_TIME_ACCUM_DIFF(dst->time_write, add->time_write, sub->time_write);
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a53fc52..e9f6a3b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3338,6 +3338,7 @@ pgstat_get_db_entry(Oid databaseid, bool create)
 		result->n_xact_rollback = 0;
 		result->n_blocks_fetched = 0;
 		result->n_blocks_hit = 0;
+		INSTR_TIME_SET_ZERO(result->blocks_time);
 		result->n_tuples_returned = 0;
 		result->n_tuples_fetched = 0;
 		result->n_tuples_inserted = 0;
@@ -3412,6 +3413,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 		result->changes_since_analyze = 0;
 		result->blocks_fetched = 0;
 		result->blocks_hit = 0;
+		INSTR_TIME_SET_ZERO(result->blocks_time);
 		result->vacuum_timestamp = 0;
 		result->vacuum_count = 0;
 		result->autovac_vacuum_timestamp = 0;
@@ -4100,6 +4102,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
 			tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
+			tabentry->blocks_time = tabmsg->t_counts.t_blocks_time;
 
 			tabentry->vacuum_timestamp = 0;
 			tabentry->vacuum_count = 0;
@@ -4127,6 +4130,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
 			tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
+			INSTR_TIME_ADD(tabentry->blocks_time, tabmsg->t_counts.t_blocks_time);
 		}
 
 		/* Clamp n_live_tuples in case of negative delta_live_tuples */
@@ -4144,6 +4148,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 		dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
 		dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
 		dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
+		INSTR_TIME_ADD(dbentry->blocks_time, tabmsg->t_counts.t_blocks_time);
 	}
 }
 
@@ -4257,6 +4262,7 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
 	dbentry->n_xact_rollback = 0;
 	dbentry->n_blocks_fetched = 0;
 	dbentry->n_blocks_hit = 0;
+	INSTR_TIME_SET_ZERO(dbentry->blocks_time);
 	dbentry->n_tuples_returned = 0;
 	dbentry->n_tuples_fetched = 0;
 	dbentry->n_tuples_inserted = 0;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3924a51..a28d8e3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -67,6 +67,7 @@
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
+bool		track_iotiming = false;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -86,7 +87,7 @@ static volatile BufferDesc *PinCountWaitBuf = NULL;
 static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 				  ForkNumber forkNum, BlockNumber blockNum,
 				  ReadBufferMode mode, BufferAccessStrategy strategy,
-				  bool *hit);
+				  bool *hit, instr_time *io_time);
 static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
@@ -224,6 +225,7 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
 	bool		hit;
+	instr_time	io_time;
 	Buffer		buf;
 
 	/* Open it at the smgr level if not already done */
@@ -245,9 +247,11 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	pgstat_count_buffer_read(reln);
 	buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
+							forkNum, blockNum, mode, strategy, &hit, &io_time);
 	if (hit)
 		pgstat_count_buffer_hit(reln);
+	else
+		pgstat_count_buffer_time(reln, io_time);
 	return buf;
 }
 
@@ -267,11 +271,12 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BufferAccessStrategy strategy)
 {
 	bool		hit;
+	instr_time	io_time;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
 	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
-							 mode, strategy, &hit);
+							 mode, strategy, &hit, &io_time);
 }
 
 
@@ -279,19 +284,22 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
+ * If track_iotiming is enabled, *io_time is set to the time the read took.
  */
 static Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy, bool *hit, instr_time *io_time)
 {
 	volatile BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
+	instr_time io_start, io_end;
 
 	*hit = false;
+	INSTR_TIME_SET_ZERO(*io_time);
 
 	/* Make sure we will have room to remember the buffer pin */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -437,8 +445,18 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
+			if (track_iotiming)
+				INSTR_TIME_SET_CURRENT(io_start);
+
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			if (track_iotiming)
+			{
+				INSTR_TIME_SET_CURRENT(io_end);
+				INSTR_TIME_ACCUM_DIFF(*io_time, io_end, io_start);
+				INSTR_TIME_ADD(pgBufferUsage.time_read, *io_time);
+			}
+
 			/* check for garbage data */
 			if (!PageHeaderIsValid((PageHeader) bufBlock))
 			{
@@ -1874,6 +1892,7 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcontext;
+	instr_time io_start, io_end;
 
 	/*
 	 * Acquire the buffer's io_in_progress lock.  If StartBufferIO returns
@@ -1921,12 +1940,21 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 	buf->flags &= ~BM_JUST_DIRTIED;
 	UnlockBufHdr(buf);
 
+	if (track_iotiming)
+		INSTR_TIME_SET_CURRENT(io_start);
+
 	smgrwrite(reln,
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  (char *) BufHdrGetBlock(buf),
 			  false);
 
+	if (track_iotiming)
+	{
+		INSTR_TIME_SET_CURRENT(io_end);
+		INSTR_TIME_ACCUM_DIFF(pgBufferUsage.time_write, io_end, io_start);
+	}
+
 	pgBufferUsage.shared_blks_written++;
 
 	/*
@@ -2670,6 +2698,7 @@ WaitIO(volatile BufferDesc *buf)
 static bool
 StartBufferIO(volatile BufferDesc *buf, bool forInput)
 {
+	instr_time wait_start, wait_end;
 	Assert(!InProgressBuf);
 
 	for (;;)
@@ -2678,7 +2707,28 @@ StartBufferIO(volatile BufferDesc *buf, bool forInput)
 		 * Grab the io_in_progress lock so that other processes can wait for
 		 * me to finish the I/O.
 		 */
-		LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+		if (forInput && track_iotiming) {
+			/*
+			 * We need to time the lock wait to account for I/O waits where
+			 * someone else is doing the work for us. Conditional acquire
+			 * avoids double timing overhead when we do the I/O ourselves.
+			 */
+			if (!LWLockConditionalAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE)) {
+				INSTR_TIME_SET_CURRENT(wait_start);
+
+				LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+
+				/*
+				 * Only do backend local accounting, stats collector will get the
+				 * wait from the backend doing the I/O.
+				 */
+				INSTR_TIME_SET_CURRENT(wait_end);
+				INSTR_TIME_ACCUM_DIFF(pgBufferUsage.time_read, wait_end, wait_start);
+			}
+		} else {
+			LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+		}
+
 
 		LockBufHdr(buf);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 68b2527..d0b6c57 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -22,6 +22,7 @@
 #include "utils/builtins.h"
 #include "utils/inet.h"
 #include "utils/timestamp.h"
+#include "portability/instr_time.h"
 
 /* bogus ... these externs should be in a header file */
 extern Datum pg_stat_get_numscans(PG_FUNCTION_ARGS);
@@ -35,6 +36,7 @@ extern Datum pg_stat_get_live_tuples(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_dead_tuples(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_blocks_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_blocks_hit(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_blocks_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_last_vacuum_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_last_autovacuum_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_last_analyze_time(PG_FUNCTION_ARGS);
@@ -67,6 +69,7 @@ extern Datum pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_db_blocks_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS);
@@ -102,6 +105,7 @@ extern Datum pg_stat_get_xact_tuples_deleted(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_blocks_hit(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_xact_blocks_time(PG_FUNCTION_ARGS);
 
 extern Datum pg_stat_get_xact_function_calls(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_function_time(PG_FUNCTION_ARGS);
@@ -292,6 +296,22 @@ pg_stat_get_blocks_hit(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_blocks_time(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	double		result;
+	PgStat_StatTabEntry *tabentry;
+
+	if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+		result = 0;
+	else
+		/* Cast overflows in about 300'000 years of io time */
+		result = (int64) INSTR_TIME_GET_MICROSEC(tabentry->blocks_time);
+
+	PG_RETURN_INT64(result);
+}
+
+Datum
 pg_stat_get_last_vacuum_time(PG_FUNCTION_ARGS)
 {
 	Oid			relid = PG_GETARG_OID(0);
@@ -1120,6 +1140,22 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
 
 
 Datum
+pg_stat_get_db_blocks_time(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) INSTR_TIME_GET_MICROSEC(dbentry->blocks_time);
+
+	PG_RETURN_INT64(result);
+}
+
+
+Datum
 pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
 {
 	Oid			dbid = PG_GETARG_OID(0);
@@ -1565,6 +1601,21 @@ pg_stat_get_xact_blocks_hit(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_xact_blocks_time(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_TableStatus *tabentry;
+
+	if ((tabentry = find_tabstat_entry(relid)) == NULL)
+		result = 0;
+	else
+		result = (int64) INSTR_TIME_GET_MICROSEC(tabentry->t_counts.t_blocks_time);
+
+	PG_RETURN_INT64(result);
+}
+
+Datum
 pg_stat_get_xact_function_calls(PG_FUNCTION_ARGS)
 {
 	Oid			funcid = PG_GETARG_OID(0);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a5becbe..5e32278 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1018,6 +1018,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_iotiming", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing information for database IO activity."),
+			NULL
+		},
+		&track_iotiming,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, STATS_COLLECTOR,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 96da086..cbae3bf 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -424,6 +424,7 @@
 
 #track_activities = on
 #track_counts = on
+#track_iotiming = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024 	# (change requires restart)
 #update_process_title = on
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 8700d0d..47cad5a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2672,6 +2672,12 @@ DATA(insert OID = 3063 ( pg_stat_get_buf_fsync_backend PGNSP PGUID 12 1 0 0 0 f
 DESCR("statistics: number of backend buffer writes that did their own fsync");
 DATA(insert OID = 2859 ( pg_stat_get_buf_alloc			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_alloc _null_ _null_ _null_ ));
 DESCR("statistics: number of buffer allocations");
+DATA(insert OID = 3160 (  pg_stat_get_blocks_time		PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_blocks_time _null_ _null_ _null_ ));
+DESCR("statistics: duration of block reads");
+DATA(insert OID = 3161 (  pg_stat_get_db_blocks_time	PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_db_blocks_time _null_ _null_ _null_ ));
+DESCR("statistics: duration of block reads for database");
+DATA(insert OID = 3162 (  pg_stat_get_xact_blocks_time	PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_time _null_ _null_ _null_ ));
+DESCR("statistics: duration of block reads in current transaction");
 
 DATA(insert OID = 2978 (  pg_stat_get_function_calls		PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_function_calls _null_ _null_ _null_ ));
 DESCR("statistics: number of function calls");
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 066f684..ad13914 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -28,6 +28,8 @@ typedef struct BufferUsage
 	long		local_blks_written;		/* # of local disk blocks written */
 	long		temp_blks_read; /* # of temp blocks read */
 	long		temp_blks_written;		/* # of temp blocks written */
+	instr_time	time_read;	/* time spent reading */
+	instr_time	time_write;	/* time spent writing */
 } BufferUsage;
 
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1281bd8..175b5df 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -97,6 +97,7 @@ typedef struct PgStat_TableCounts
 
 	PgStat_Counter t_blocks_fetched;
 	PgStat_Counter t_blocks_hit;
+	instr_time     t_blocks_time;
 } PgStat_TableCounts;
 
 /* Possible targets for resetting cluster-wide shared values */
@@ -522,6 +523,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_xact_rollback;
 	PgStat_Counter n_blocks_fetched;
 	PgStat_Counter n_blocks_hit;
+	instr_time	blocks_time;
 	PgStat_Counter n_tuples_returned;
 	PgStat_Counter n_tuples_fetched;
 	PgStat_Counter n_tuples_inserted;
@@ -573,6 +575,7 @@ typedef struct PgStat_StatTabEntry
 
 	PgStat_Counter blocks_fetched;
 	PgStat_Counter blocks_hit;
+	instr_time	blocks_time;
 
 	TimestampTz vacuum_timestamp;		/* user initiated vacuum */
 	PgStat_Counter vacuum_count;
@@ -816,6 +819,13 @@ extern void pgstat_initstats(Relation rel);
 		if ((rel)->pgstat_info != NULL)								\
 			(rel)->pgstat_info->t_counts.t_blocks_hit++;			\
 	} while (0)
+#define pgstat_count_buffer_time(rel, io)							\
+	do {															\
+		if ((rel)->pgstat_info != NULL)								\
+			INSTR_TIME_ADD(											\
+				(rel)->pgstat_info->t_counts.t_blocks_time,			\
+				(io));												\
+	} while (0)
 
 extern void pgstat_count_heap_insert(Relation rel, int n);
 extern void pgstat_count_heap_update(Relation rel, bool hot);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index de1bbd0..d72bad9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -48,6 +48,7 @@ extern PGDLLIMPORT int NBuffers;
 extern bool zero_damaged_pages;
 extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
+extern bool track_iotiming;
 extern int	target_prefetch_pages;
 
 /* in buf_init.c */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 0275a0e..f0aa424 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1296,7 +1296,7 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
  pg_stat_all_indexes             | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, pg_stat_get_numscans(i.oid) AS idx_scan, pg_stat_get_tuples_returned(i.oid) AS idx_tup_read, pg_stat_get_tuples_fetched(i.oid) AS idx_tup_fetch FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
  pg_stat_all_tables              | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, pg_stat_get_numscans(c.oid) AS seq_scan, pg_stat_get_tuples_returned(c.oid) AS seq_tup_read, (sum(pg_stat_get_numscans(i.indexrelid)))::bigint AS idx_scan, ((sum(pg_stat_get_tuples_fetched(i.indexrelid)))::bigint + pg_stat_get_tuples_fetched(c.oid)) AS idx_tup_fetch, pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins, pg_stat_get_tuples_updated(c.oid) AS n_tup_upd, pg_stat_get_tuples_deleted(c.oid) AS n_tup_del, pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd, pg_stat_get_live_tuples(c.oid) AS n_live_tup, pg_stat_get_dead_tuples(c.oid) AS n_dead_tup, pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum, pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum, pg_stat_get_last_analyze_time(c.oid) AS last_analyze, pg_stat_get_last_autoanalyze_time(c.oid) AS last_autoanalyze, pg_stat_get_vacuum_count(c.oid) AS vacuum_count, pg_stat_get_autovacuum_count(c.oid) AS autovacuum_count, pg_stat_get_analyze_count(c.oid) AS analyze_count, pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count FROM ((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname;
  pg_stat_bgwriter                | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
- pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_temp_files(d.oid) AS temp_files, pg_stat_get_db_temp_bytes(d.oid) AS temp_bytes, pg_stat_get_db_deadlocks(d.oid) AS deadlocks, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
+ pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_blocks_time(d.oid) AS blks_time, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_temp_files(d.oid) AS temp_files, pg_stat_get_db_temp_bytes(d.oid) AS temp_bytes, pg_stat_get_db_deadlocks(d.oid) AS deadlocks, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
  pg_stat_database_conflicts      | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
  pg_stat_replication             | SELECT s.pid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_hostname, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.replay_location, w.sync_priority, w.sync_state FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, waiting, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port), pg_authid u, pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) WHERE ((s.usesysid = u.oid) AND (s.pid = w.pid));
  pg_stat_sys_indexes             | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
@@ -1308,15 +1308,15 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
  pg_stat_xact_sys_tables         | SELECT pg_stat_xact_all_tables.relid, pg_stat_xact_all_tables.schemaname, pg_stat_xact_all_tables.relname, pg_stat_xact_all_tables.seq_scan, pg_stat_xact_all_tables.seq_tup_read, pg_stat_xact_all_tables.idx_scan, pg_stat_xact_all_tables.idx_tup_fetch, pg_stat_xact_all_tables.n_tup_ins, pg_stat_xact_all_tables.n_tup_upd, pg_stat_xact_all_tables.n_tup_del, pg_stat_xact_all_tables.n_tup_hot_upd FROM pg_stat_xact_all_tables WHERE ((pg_stat_xact_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_xact_all_tables.schemaname ~ '^pg_toast'::text));
  pg_stat_xact_user_functions     | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_xact_function_calls(p.oid) AS calls, (pg_stat_get_xact_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_xact_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_xact_function_calls(p.oid) IS NOT NULL));
  pg_stat_xact_user_tables        | SELECT pg_stat_xact_all_tables.relid, pg_stat_xact_all_tables.schemaname, pg_stat_xact_all_tables.relname, pg_stat_xact_all_tables.seq_scan, pg_stat_xact_all_tables.seq_tup_read, pg_stat_xact_all_tables.idx_scan, pg_stat_xact_all_tables.idx_tup_fetch, pg_stat_xact_all_tables.n_tup_ins, pg_stat_xact_all_tables.n_tup_upd, pg_stat_xact_all_tables.n_tup_del, pg_stat_xact_all_tables.n_tup_hot_upd FROM pg_stat_xact_all_tables WHERE ((pg_stat_xact_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_xact_all_tables.schemaname !~ '^pg_toast'::text));
- pg_statio_all_indexes           | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, (pg_stat_get_blocks_fetched(i.oid) - pg_stat_get_blocks_hit(i.oid)) AS idx_blks_read, pg_stat_get_blocks_hit(i.oid) AS idx_blks_hit FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
- pg_statio_all_sequences         | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS blks_read, pg_stat_get_blocks_hit(c.oid) AS blks_hit FROM (pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = 'S'::"char");
- pg_statio_all_tables            | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS heap_blks_read, pg_stat_get_blocks_hit(c.oid) AS heap_blks_hit, (sum((pg_stat_get_blocks_fetched(i.indexrelid) - pg_stat_get_blocks_hit(i.indexrelid))))::bigint AS idx_blks_read, (sum(pg_stat_get_blocks_hit(i.indexrelid)))::bigint AS idx_blks_hit, (pg_stat_get_blocks_fetched(t.oid) - pg_stat_get_blocks_hit(t.oid)) AS toast_blks_read, pg_stat_get_blocks_hit(t.oid) AS toast_blks_hit, (pg_stat_get_blocks_fetched(x.oid) - pg_stat_get_blocks_hit(x.oid)) AS tidx_blks_read, pg_stat_get_blocks_hit(x.oid) AS tidx_blks_hit FROM ((((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_class t ON ((c.reltoastrelid = t.oid))) LEFT JOIN pg_class x ON ((t.reltoastidxid = x.oid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname, t.oid, x.oid;
- pg_statio_sys_indexes           | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_indexes.schemaname ~ '^pg_toast'::text));
- pg_statio_sys_sequences         | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_sequences.schemaname ~ '^pg_toast'::text));
- pg_statio_sys_tables            | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_tables.schemaname ~ '^pg_toast'::text));
- pg_statio_user_indexes          | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_indexes.schemaname !~ '^pg_toast'::text));
- pg_statio_user_sequences        | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_sequences.schemaname !~ '^pg_toast'::text));
- pg_statio_user_tables           | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_tables.schemaname !~ '^pg_toast'::text));
+ pg_statio_all_indexes           | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, (pg_stat_get_blocks_fetched(i.oid) - pg_stat_get_blocks_hit(i.oid)) AS idx_blks_read, pg_stat_get_blocks_hit(i.oid) AS idx_blks_hit, pg_stat_get_blocks_time(i.oid) AS idx_blks_time FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
+ pg_statio_all_sequences         | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS blks_read, pg_stat_get_blocks_hit(c.oid) AS blks_hit, pg_stat_get_blocks_time(c.oid) AS blks_time FROM (pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = 'S'::"char");
+ pg_statio_all_tables            | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS heap_blks_read, pg_stat_get_blocks_hit(c.oid) AS heap_blks_hit, pg_stat_get_blocks_time(c.oid) AS heap_blks_time, (sum((pg_stat_get_blocks_fetched(i.indexrelid) - pg_stat_get_blocks_hit(i.indexrelid))))::bigint AS idx_blks_read, (sum(pg_stat_get_blocks_hit(i.indexrelid)))::bigint AS idx_blks_hit, (sum(pg_stat_get_blocks_time(i.indexrelid)))::bigint AS idx_blks_time, (pg_stat_get_blocks_fetched(t.oid) - pg_stat_get_blocks_hit(t.oid)) AS toast_blks_read, pg_stat_get_blocks_hit(t.oid) AS toast_blks_hit, pg_stat_get_blocks_time(t.oid) AS toast_blks_time, (pg_stat_get_blocks_fetched(x.oid) - pg_stat_get_blocks_hit(x.oid)) AS tidx_blks_read, pg_stat_get_blocks_hit(x.oid) AS tidx_blks_hit, pg_stat_get_blocks_time(x.oid) AS tidx_blks_time FROM ((((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_class t ON ((c.reltoastrelid = t.oid))) LEFT JOIN pg_class x ON ((t.reltoastidxid = x.oid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname, t.oid, x.oid;
+ pg_statio_sys_indexes           | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit, pg_statio_all_indexes.idx_blks_time FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_indexes.schemaname ~ '^pg_toast'::text));
+ pg_statio_sys_sequences         | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit, pg_statio_all_sequences.blks_time FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_sequences.schemaname ~ '^pg_toast'::text));
+ pg_statio_sys_tables            | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.heap_blks_time, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.idx_blks_time, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.toast_blks_time, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit, pg_statio_all_tables.tidx_blks_time FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_tables.schemaname ~ '^pg_toast'::text));
+ pg_statio_user_indexes          | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit, pg_statio_all_indexes.idx_blks_time FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_indexes.schemaname !~ '^pg_toast'::text));
+ pg_statio_user_sequences        | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit, pg_statio_all_sequences.blks_time FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_sequences.schemaname !~ '^pg_toast'::text));
+ pg_statio_user_tables           | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.heap_blks_time, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.idx_blks_time, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.toast_blks_time, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit, pg_statio_all_tables.tidx_blks_time FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_tables.schemaname !~ '^pg_toast'::text));
  pg_stats                        | SELECT n.nspname AS schemaname, c.relname AS tablename, a.attname, s.stainherit AS inherited, s.stanullfrac AS null_frac, s.stawidth AS avg_width, s.stadistinct AS n_distinct, CASE WHEN (s.stakind1 = ANY (ARRAY[1, 4])) THEN s.stavalues1 WHEN (s.stakind2 = ANY (ARRAY[1, 4])) THEN s.stavalues2 WHEN (s.stakind3 = ANY (ARRAY[1, 4])) THEN s.stavalues3 WHEN (s.stakind4 = ANY (ARRAY[1, 4])) THEN s.stavalues4 ELSE NULL::anyarray END AS most_common_vals, CASE WHEN (s.stakind1 = ANY (ARRAY[1, 4])) THEN s.stanumbers1 WHEN (s.stakind2 = ANY (ARRAY[1, 4])) THEN s.stanumbers2 WHEN (s.stakind3 = ANY (ARRAY[1, 4])) THEN s.stanumbers3 WHEN (s.stakind4 = ANY (ARRAY[1, 4])) THEN s.stanumbers4 ELSE NULL::real[] END AS most_common_freqs, CASE WHEN (s.stakind1 = 2) THEN s.stavalues1 WHEN (s.stakind2 = 2) THEN s.stavalues2 WHEN (s.stakind3 = 2) THEN s.stavalues3 WHEN (s.stakind4 = 2) THEN s.stavalues4 ELSE NULL::anyarray END AS histogram_bounds, CASE WHEN (s.stakind1 = 3) THEN s.stanumbers1[1] WHEN (s.stakind2 = 3) THEN s.stanumbers2[1] WHEN (s.stakind3 = 3) THEN s.stanumbers3[1] WHEN (s.stakind4 = 3) THEN s.stanumbers4[1] ELSE NULL::real END AS correlation FROM (((pg_statistic s JOIN pg_class c ON ((c.oid = s.starelid))) JOIN pg_attribute a ON (((c.oid = a.attrelid) AND (a.attnum = s.staattnum)))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE ((NOT a.attisdropped) AND has_column_privilege(c.oid, a.attnum, 'select'::text));
  pg_tables                       | SELECT n.nspname AS schemaname, c.relname AS tablename, pg_get_userbyid(c.relowner) AS tableowner, t.spcname AS tablespace, c.relhasindex AS hasindexes, c.relhasrules AS hasrules, c.relhastriggers AS hastriggers FROM ((pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) LEFT JOIN pg_tablespace t ON ((t.oid = c.reltablespace))) WHERE (c.relkind = 'r'::"char");
  pg_timezone_abbrevs             | SELECT pg_timezone_abbrevs.abbrev, pg_timezone_abbrevs.utc_offset, pg_timezone_abbrevs.is_dst FROM pg_timezone_abbrevs() pg_timezone_abbrevs(abbrev, utc_offset, is_dst);

io-stats-statement.v4.patchtext/x-patch; charset=US-ASCII; name=io-stats-statement.v4.patchDownload

diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
index 223271d..f976419 100644
--- a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
@@ -28,7 +28,9 @@ CREATE FUNCTION pg_stat_statements(
     OUT local_blks_dirtied int8,
     OUT local_blks_written int8,
     OUT temp_blks_read int8,
-    OUT temp_blks_written int8
+    OUT temp_blks_written int8,
+    OUT time_read float8,
+    OUT time_write float8
 )
 RETURNS SETOF record
 AS 'MODULE_PATHNAME'
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
index 1233736..f4bdf12 100644
--- a/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
@@ -25,7 +25,9 @@ CREATE FUNCTION pg_stat_statements(
     OUT local_blks_dirtied int8,
     OUT local_blks_written int8,
     OUT temp_blks_read int8,
-    OUT temp_blks_written int8
+    OUT temp_blks_written int8,
+    OUT time_read float8,
+    OUT time_write float8
 )
 RETURNS SETOF record
 AS 'MODULE_PATHNAME'
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 914fbf2..8e34740 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -85,6 +85,8 @@ typedef struct Counters
 	int64		local_blks_written;		/* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;		/* # of temp blocks written */
+	double		time_read;		/* time spent reading in seconds */
+	double		time_write;		/* time spent writing in seconds */
 	double		usage;			/* usage factor */
 } Counters;
 
@@ -618,9 +620,9 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString,
 		instr_time	start;
 		instr_time	duration;
 		uint64		rows = 0;
-		BufferUsage bufusage;
+		BufferUsage bufusage_start, bufusage;
 
-		bufusage = pgBufferUsage;
+		bufusage_start = pgBufferUsage;
 		INSTR_TIME_SET_CURRENT(start);
 
 		nested_level++;
@@ -651,25 +653,29 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString,
 
 		/* calc differences of buffer counters. */
 		bufusage.shared_blks_hit =
-			pgBufferUsage.shared_blks_hit - bufusage.shared_blks_hit;
+			pgBufferUsage.shared_blks_hit - bufusage_start.shared_blks_hit;
 		bufusage.shared_blks_read =
-			pgBufferUsage.shared_blks_read - bufusage.shared_blks_read;
+			pgBufferUsage.shared_blks_read - bufusage_start.shared_blks_read;
 		bufusage.shared_blks_dirtied =
-			pgBufferUsage.shared_blks_dirtied - bufusage.shared_blks_dirtied;
+			pgBufferUsage.shared_blks_dirtied - bufusage_start.shared_blks_dirtied;
 		bufusage.shared_blks_written =
-			pgBufferUsage.shared_blks_written - bufusage.shared_blks_written;
+			pgBufferUsage.shared_blks_written - bufusage_start.shared_blks_written;
 		bufusage.local_blks_hit =
-			pgBufferUsage.local_blks_hit - bufusage.local_blks_hit;
+			pgBufferUsage.local_blks_hit - bufusage_start.local_blks_hit;
 		bufusage.local_blks_read =
-			pgBufferUsage.local_blks_read - bufusage.local_blks_read;
+			pgBufferUsage.local_blks_read - bufusage_start.local_blks_read;
 		bufusage.local_blks_dirtied =
-			pgBufferUsage.local_blks_dirtied - bufusage.local_blks_dirtied;
+			pgBufferUsage.local_blks_dirtied - bufusage_start.local_blks_dirtied;
 		bufusage.local_blks_written =
-			pgBufferUsage.local_blks_written - bufusage.local_blks_written;
+			pgBufferUsage.local_blks_written - bufusage_start.local_blks_written;
 		bufusage.temp_blks_read =
-			pgBufferUsage.temp_blks_read - bufusage.temp_blks_read;
+			pgBufferUsage.temp_blks_read - bufusage_start.temp_blks_read;
 		bufusage.temp_blks_written =
-			pgBufferUsage.temp_blks_written - bufusage.temp_blks_written;
+			pgBufferUsage.temp_blks_written - bufusage_start.temp_blks_written;
+		bufusage.time_read = pgBufferUsage.time_read;
+		INSTR_TIME_SUBTRACT(bufusage.time_read, bufusage_start.time_read);
+		bufusage.time_write = pgBufferUsage.time_write;
+		INSTR_TIME_SUBTRACT(bufusage.time_write, bufusage_start.time_write);
 
 		pgss_store(queryString, INSTR_TIME_GET_DOUBLE(duration), rows,
 				   &bufusage);
@@ -780,6 +786,8 @@ pgss_store(const char *query, double total_time, uint64 rows,
 		e->counters.local_blks_written += bufusage->local_blks_written;
 		e->counters.temp_blks_read += bufusage->temp_blks_read;
 		e->counters.temp_blks_written += bufusage->temp_blks_written;
+		e->counters.time_read +=  INSTR_TIME_GET_DOUBLE(bufusage->time_read);
+		e->counters.time_write += INSTR_TIME_GET_DOUBLE(bufusage->time_write);
 		e->counters.usage += usage;
 		SpinLockRelease(&e->mutex);
 	}
@@ -802,7 +810,7 @@ pg_stat_statements_reset(PG_FUNCTION_ARGS)
 }
 
 #define PG_STAT_STATEMENTS_COLS_V1_0	14
-#define PG_STAT_STATEMENTS_COLS			16
+#define PG_STAT_STATEMENTS_COLS			18
 
 /*
  * Retrieve statement statistics.
@@ -819,7 +827,7 @@ pg_stat_statements(PG_FUNCTION_ARGS)
 	bool		is_superuser = superuser();
 	HASH_SEQ_STATUS hash_seq;
 	pgssEntry  *entry;
-	bool		sql_supports_dirty_counters = true;
+	bool		sql_supports_v1_1_counters = true;
 
 	if (!pgss || !pgss_hash)
 		ereport(ERROR,
@@ -841,7 +849,7 @@ pg_stat_statements(PG_FUNCTION_ARGS)
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
 	if (tupdesc->natts == PG_STAT_STATEMENTS_COLS_V1_0)
-		sql_supports_dirty_counters = false;
+		sql_supports_v1_1_counters = false;
 
 	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
 	oldcontext = MemoryContextSwitchTo(per_query_ctx);
@@ -899,18 +907,22 @@ pg_stat_statements(PG_FUNCTION_ARGS)
 		values[i++] = Int64GetDatumFast(tmp.rows);
 		values[i++] = Int64GetDatumFast(tmp.shared_blks_hit);
 		values[i++] = Int64GetDatumFast(tmp.shared_blks_read);
-		if (sql_supports_dirty_counters)
+		if (sql_supports_v1_1_counters)
 			values[i++] = Int64GetDatumFast(tmp.shared_blks_dirtied);
 		values[i++] = Int64GetDatumFast(tmp.shared_blks_written);
 		values[i++] = Int64GetDatumFast(tmp.local_blks_hit);
 		values[i++] = Int64GetDatumFast(tmp.local_blks_read);
-		if (sql_supports_dirty_counters)
+		if (sql_supports_v1_1_counters)
 			values[i++] = Int64GetDatumFast(tmp.local_blks_dirtied);
 		values[i++] = Int64GetDatumFast(tmp.local_blks_written);
 		values[i++] = Int64GetDatumFast(tmp.temp_blks_read);
 		values[i++] = Int64GetDatumFast(tmp.temp_blks_written);
+		if (sql_supports_v1_1_counters) {
+			values[i++] = Float8GetDatumFast(tmp.time_read);
+			values[i++] = Float8GetDatumFast(tmp.time_write);
+		}
 
-		Assert(i == sql_supports_dirty_counters ? \
+		Assert(i == sql_supports_v1_1_counters ? \
 			PG_STAT_STATEMENTS_COLS : PG_STAT_STATEMENTS_COLS_V1_0);
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index ab34ca1..f9da247 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -155,6 +155,20 @@
       <entry>Total number of temp blocks writes by the statement</entry>
      </row>
 
+     <row>
+      <entry><structfield>time_read</structfield></entry>
+      <entry><type>double precision</type></entry>
+      <entry></entry>
+      <entry>Total time spent by the statement for reading blocks, in seconds</entry>
+     </row>
+
+     <row>
+      <entry><structfield>time_write</structfield></entry>
+      <entry><type>double precision</type></entry>
+      <entry></entry>
+      <entry>Total time spent by the statement for writing out dirty blocks, in seconds</entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>

#23

Jim Nasby

jim@nasby.net

almost 14 years ago

In reply to: Ants Aasma (#21)

Re: Patch: add timing of buffer I/O requests

On 2/22/12 10:35 AM, Ants Aasma wrote:

The reason why I didn't add write timings to relation stats is that I
couldn't figure out what the semantics should be. It could be either
"time spent waiting for this relations blocks to be written out" or
"time spent waiting for some other relations blocks to be written out
to free space for this relations block" or maybe distribute the cost,
background writes could be included or excluded. Writes usually return
quickly, unless lots of possibly unrelated writes have dirtied enough
of OS cache, etc. I figured that what ever choices I made, they
wouldn't really help anyone diagnose anything. Having global write
timings in pg_stat_bgwriter might be useful, but I feel that is
something for another patch.

I know it's not perfect, but I would argue that what users care about most of the time is time taken up in actual backends. So I wouldn't worry about bgwriter. I also wouldn't worry about time spent waiting to find a buffer at this point (see below).

I still think a full wait timing interface is the right long-term direction
here. It's hard to reject this idea when it seems to be working right now
though, while more comprehensive wait storage is still at least a release
off. Opinions welcome, I'm still juggling this around now that I have it
working again.

I agree that wait timing interface is the right direction. I have
thought a bit about it and could share some ideas - maybe I should
create a wiki page where the general design could be hashed out?

Yes, I think a wiki would be a good place to start. As you showed in your previous question about writes there's a *lot* of places where timing info would be useful to us.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#24

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Ants Aasma (#22)

Re: Patch: add timing of buffer I/O requests

On Fri, Feb 24, 2012 at 2:23 PM, Ants Aasma <ants.aasma@eesti.ee> wrote:

On Wed, Feb 22, 2012 at 6:35 PM, Ants Aasma <ants.aasma@eesti.ee> wrote:

Some implementation notes. This currently fails regression test
create_function_3, haven't looked into why yet.

I'll take a look at it.

The failure was due to leakproof changes to pgproc. Attached patches
are adjusted accordingly and rebased over Robert's blocks dirtied
patch.

This seems to have bitrotted again. :-(

Can you please rebase again?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#25

Ants Aasma

ants@cybertec.at

almost 14 years ago

In reply to: Robert Haas (#24)

2 attachment(s)

Re: Patch: add timing of buffer I/O requests

On Wed, Mar 21, 2012 at 5:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:

This seems to have bitrotted again. :-(

Can you please rebase again?

Rebased.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Attachments:

io-stats.v5.patchtext/x-patch; charset=US-ASCII; name=io-stats.v5.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3e17875..97e8e51 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4284,6 +4284,25 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-iotiming" xreflabel="track_iotiming">
+      <term><varname>track_iotiming</varname> (<type>boolean</type>)</term>
+      <indexterm>
+       <primary><varname>track_iotiming</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Enables timing of database I/O calls.
+        This parameter is off by default, because it may cause significant
+        overhead if the platform doesn't support fast timing information.
+        Only superusers can change this setting.
+       </para>
+       <para>
+        You can use the <xref linkend="pgtesttiming"> tool to find out the
+        overhead of timing on your system.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)</term>
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 840e54a..ba69382 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -116,7 +116,8 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
    <productname>PostgreSQL</productname>'s <firstterm>statistics collector</>
    is a subsystem that supports collection and reporting of information about
    server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
+   and indexes in both disk-block and individual-row terms and time disk-block
+   accesses.  It also tracks
    the total number of rows in each table, and information about vacuum and
    analyze actions for each table.  It can also count calls to user-defined
    functions and the total time spent in each one.
@@ -145,6 +146,11 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
   </para>
 
   <para>
+   The parameter <xref linkend="guc-track-iotiming"> enables timing of I/O
+   requests.
+  </para>
+
+  <para>
    The parameter <xref linkend="guc-track-functions"> enables tracking of
    usage of user-defined functions.
   </para>
@@ -483,7 +489,10 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
    handles disk I/O, data that is not in the
    <productname>PostgreSQL</> buffer cache might still reside in the
    kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
+   requiring a physical read.
+   Timing info shows how much user queries were delayed by buffer reads
+   in aggregate.
+   Users interested in obtaining more
    detailed information on <productname>PostgreSQL</> I/O behavior are
    advised to use the <productname>PostgreSQL</> statistics collector
    in combination with operating system utilities that allow insight
@@ -885,6 +894,14 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      the <function>pg_stat_get_db_blocks_hit</function> function.</entry>
     </row>
     <row>
+     <entry>blks_time</entry>
+     <entry><type>bigint</></entry>
+     <entry>Total time spent waiting for disk blocks to be read
+     in this database, in microseconds.
+     This value can also be returned by directly calling
+     the <function>pg_stat_get_db_blocks_time</function> function.</entry>
+    </row>
+    <row>
      <entry>tup_returned</entry>
      <entry><type>bigint</></entry>
      <entry>The number of rows returned by queries in this database.
@@ -1441,6 +1458,14 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      the <function>pg_stat_get_blocks_hit</function> function.</entry>
     </row>
     <row>
+     <entry>heap_blks_time</entry>
+     <entry><type>bigint</></entry>
+     <entry>Total time spent waiting for disk blocks to be read
+     from this table, in microseconds.
+     This value can also be returned by directly calling
+     the <function>pg_stat_get_blocks_time</function> function.</entry>
+    </row>
+    <row>
      <entry>idx_blks_read</entry>
      <entry><type>name</></entry>
      <entry>Number of disk blocks read from all indexes on this table</entry>
@@ -1451,6 +1476,12 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry>Number of buffer hits in all indexes of this table.</entry>
     </row>
     <row>
+     <entry>idx_blks_time</entry>
+     <entry><type>bigint</></entry>
+     <entry>Total time spent waiting for disk blocks to be read
+     from all indexes on this table, in microseconds.</entry>
+    </row>
+    <row>
      <entry>toast_blks_read</entry>
      <entry><type>name</></entry>
      <entry>Number of disk blocks read from this table's TOAST table (if any)</entry>
@@ -1461,6 +1492,12 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry>Number of buffer hits in this table's TOAST table (if any)</entry>
     </row>
     <row>
+     <entry>toast_blks_time</entry>
+     <entry><type>bigint</></entry>
+     <entry>Total time spent waiting for disk blocks to be read
+     from this table's TOAST table (if any), in microseconds.</entry>
+    </row>
+    <row>
      <entry>tidx_blks_read</entry>
      <entry><type>name</></entry>
      <entry>Number of disk blocks read from this table's TOAST table index (if any)</entry>
@@ -1470,6 +1507,12 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><type>name</></entry>
      <entry>Number of buffer hits in this table's TOAST table index (if any)</entry>
     </row>
+    <row>
+     <entry>tidx_blks_time</entry>
+     <entry><type>bigint</></entry>
+     <entry>Total time spent waiting for disk blocks to be read
+     from this table's TOAST table index (if any), in microseconds.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
@@ -1536,6 +1579,14 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      This value can also be returned by directly calling
      the <function>pg_stat_get_blocks_hit</function> function.</entry>
     </row>
+    <row>
+     <entry>idx_blks_time</entry>
+     <entry><type>bigint</></entry>
+     <entry>Total time spent waiting for disk blocks to be read
+     from the index, in microseconds.
+     This value can also be returned by directly calling
+     the <function>pg_stat_get_blocks_time</function> function.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
@@ -1703,6 +1754,14 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
       </entry>
      </row>
 
+     <row>
+      <entry><literal><function>pg_stat_get_xact_blocks_time</function>(<type>oid</type>)</literal></entry>
+      <entry><type>bigint</type></entry>
+      <entry>
+       Total time spent waiting for disk block requests for table or index, in the current transaction,
+       in microseconds.
+      </entry>
+     </row>
 
      <row>
       <entry><literal><function>pg_stat_get_wal_senders()</function></literal></entry>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ab594eb..430bea1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -440,15 +440,19 @@ CREATE VIEW pg_statio_all_tables AS
             pg_stat_get_blocks_fetched(C.oid) -
                     pg_stat_get_blocks_hit(C.oid) AS heap_blks_read,
             pg_stat_get_blocks_hit(C.oid) AS heap_blks_hit,
+            pg_stat_get_blocks_time(C.oid) AS heap_blks_time,
             sum(pg_stat_get_blocks_fetched(I.indexrelid) -
                     pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_read,
             sum(pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_hit,
+            sum(pg_stat_get_blocks_time(I.indexrelid))::bigint AS idx_blks_time,
             pg_stat_get_blocks_fetched(T.oid) -
                     pg_stat_get_blocks_hit(T.oid) AS toast_blks_read,
             pg_stat_get_blocks_hit(T.oid) AS toast_blks_hit,
+            pg_stat_get_blocks_time(T.oid) AS toast_blks_time,
             pg_stat_get_blocks_fetched(X.oid) -
                     pg_stat_get_blocks_hit(X.oid) AS tidx_blks_read,
-            pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit
+            pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit,
+            pg_stat_get_blocks_time(X.oid) AS tidx_blks_time
     FROM pg_class C LEFT JOIN
             pg_index I ON C.oid = I.indrelid LEFT JOIN
             pg_class T ON C.reltoastrelid = T.oid LEFT JOIN
@@ -502,7 +506,8 @@ CREATE VIEW pg_statio_all_indexes AS
             I.relname AS indexrelname,
             pg_stat_get_blocks_fetched(I.oid) -
                     pg_stat_get_blocks_hit(I.oid) AS idx_blks_read,
-            pg_stat_get_blocks_hit(I.oid) AS idx_blks_hit
+            pg_stat_get_blocks_hit(I.oid) AS idx_blks_hit,
+            pg_stat_get_blocks_time(I.oid) AS idx_blks_time
     FROM pg_class C JOIN
             pg_index X ON C.oid = X.indrelid JOIN
             pg_class I ON I.oid = X.indexrelid
@@ -526,7 +531,8 @@ CREATE VIEW pg_statio_all_sequences AS
             C.relname AS relname,
             pg_stat_get_blocks_fetched(C.oid) -
                     pg_stat_get_blocks_hit(C.oid) AS blks_read,
-            pg_stat_get_blocks_hit(C.oid) AS blks_hit
+            pg_stat_get_blocks_hit(C.oid) AS blks_hit,
+            pg_stat_get_blocks_time(C.oid) AS blks_time
     FROM pg_class C
             LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
     WHERE C.relkind = 'S';
@@ -595,6 +601,7 @@ CREATE VIEW pg_stat_database AS
             pg_stat_get_db_blocks_fetched(D.oid) -
                     pg_stat_get_db_blocks_hit(D.oid) AS blks_read,
             pg_stat_get_db_blocks_hit(D.oid) AS blks_hit,
+            pg_stat_get_db_blocks_time(D.oid) AS blks_time,
             pg_stat_get_db_tuples_returned(D.oid) AS tup_returned,
             pg_stat_get_db_tuples_fetched(D.oid) AS tup_fetched,
             pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a14cae1..3707f7e 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1236,9 +1236,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 									 usage->local_blks_written > 0);
 			bool		has_temp = (usage->temp_blks_read > 0 ||
 									usage->temp_blks_written > 0);
+			bool		has_timing = (!INSTR_TIME_IS_ZERO(usage->time_read) ||
+									  !INSTR_TIME_IS_ZERO(usage->time_write));
 
 			/* Show only positive counter values. */
-			if (has_shared || has_local || has_temp)
+			if (has_shared || has_local || has_temp || has_timing)
 			{
 				appendStringInfoSpaces(es->str, es->indent * 2);
 				appendStringInfoString(es->str, "Buffers:");
@@ -1289,6 +1291,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
 						appendStringInfo(es->str, " written=%ld",
 										 usage->temp_blks_written);
 				}
+				if (has_timing)
+				{
+					appendStringInfoString(es->str, " timing");
+					if (!INSTR_TIME_IS_ZERO(usage->time_read)) {
+						appendStringInfo(es->str, " read=%0.2f",
+										 INSTR_TIME_GET_MILLISEC(usage->time_read));
+					}
+					if (!INSTR_TIME_IS_ZERO(usage->time_write)) {
+						appendStringInfo(es->str, " write=%0.2f",
+										 INSTR_TIME_GET_MILLISEC(usage->time_write));
+					}
+				}
 				appendStringInfoChar(es->str, '\n');
 			}
 		}
@@ -1304,6 +1318,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
 			ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
 			ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
+			ExplainPropertyFloat("Read Waits", INSTR_TIME_GET_MILLISEC(usage->time_read), 3, es);
+			ExplainPropertyFloat("Write Waits", INSTR_TIME_GET_MILLISEC(usage->time_write), 3, es);
 		}
 	}
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6e9f450..92e56d1 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -145,4 +145,6 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	INSTR_TIME_ACCUM_DIFF(dst->time_read, add->time_read, sub->time_read);
+	INSTR_TIME_ACCUM_DIFF(dst->time_write, add->time_write, sub->time_write);
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a53fc52..e9f6a3b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3338,6 +3338,7 @@ pgstat_get_db_entry(Oid databaseid, bool create)
 		result->n_xact_rollback = 0;
 		result->n_blocks_fetched = 0;
 		result->n_blocks_hit = 0;
+		INSTR_TIME_SET_ZERO(result->blocks_time);
 		result->n_tuples_returned = 0;
 		result->n_tuples_fetched = 0;
 		result->n_tuples_inserted = 0;
@@ -3412,6 +3413,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 		result->changes_since_analyze = 0;
 		result->blocks_fetched = 0;
 		result->blocks_hit = 0;
+		INSTR_TIME_SET_ZERO(result->blocks_time);
 		result->vacuum_timestamp = 0;
 		result->vacuum_count = 0;
 		result->autovac_vacuum_timestamp = 0;
@@ -4100,6 +4102,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
 			tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
+			tabentry->blocks_time = tabmsg->t_counts.t_blocks_time;
 
 			tabentry->vacuum_timestamp = 0;
 			tabentry->vacuum_count = 0;
@@ -4127,6 +4130,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
 			tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
+			INSTR_TIME_ADD(tabentry->blocks_time, tabmsg->t_counts.t_blocks_time);
 		}
 
 		/* Clamp n_live_tuples in case of negative delta_live_tuples */
@@ -4144,6 +4148,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 		dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
 		dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
 		dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
+		INSTR_TIME_ADD(dbentry->blocks_time, tabmsg->t_counts.t_blocks_time);
 	}
 }
 
@@ -4257,6 +4262,7 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
 	dbentry->n_xact_rollback = 0;
 	dbentry->n_blocks_fetched = 0;
 	dbentry->n_blocks_hit = 0;
+	INSTR_TIME_SET_ZERO(dbentry->blocks_time);
 	dbentry->n_tuples_returned = 0;
 	dbentry->n_tuples_fetched = 0;
 	dbentry->n_tuples_inserted = 0;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3924a51..a28d8e3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -67,6 +67,7 @@
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
+bool		track_iotiming = false;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -86,7 +87,7 @@ static volatile BufferDesc *PinCountWaitBuf = NULL;
 static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 				  ForkNumber forkNum, BlockNumber blockNum,
 				  ReadBufferMode mode, BufferAccessStrategy strategy,
-				  bool *hit);
+				  bool *hit, instr_time *io_time);
 static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
@@ -224,6 +225,7 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
 	bool		hit;
+	instr_time	io_time;
 	Buffer		buf;
 
 	/* Open it at the smgr level if not already done */
@@ -245,9 +247,11 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	pgstat_count_buffer_read(reln);
 	buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
+							forkNum, blockNum, mode, strategy, &hit, &io_time);
 	if (hit)
 		pgstat_count_buffer_hit(reln);
+	else
+		pgstat_count_buffer_time(reln, io_time);
 	return buf;
 }
 
@@ -267,11 +271,12 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BufferAccessStrategy strategy)
 {
 	bool		hit;
+	instr_time	io_time;
 
 	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
 	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
-							 mode, strategy, &hit);
+							 mode, strategy, &hit, &io_time);
 }
 
 
@@ -279,19 +284,22 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
+ * If track_iotiming is enabled, *io_time is set to the time the read took.
  */
 static Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy, bool *hit, instr_time *io_time)
 {
 	volatile BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
+	instr_time io_start, io_end;
 
 	*hit = false;
+	INSTR_TIME_SET_ZERO(*io_time);
 
 	/* Make sure we will have room to remember the buffer pin */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -437,8 +445,18 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
+			if (track_iotiming)
+				INSTR_TIME_SET_CURRENT(io_start);
+
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			if (track_iotiming)
+			{
+				INSTR_TIME_SET_CURRENT(io_end);
+				INSTR_TIME_ACCUM_DIFF(*io_time, io_end, io_start);
+				INSTR_TIME_ADD(pgBufferUsage.time_read, *io_time);
+			}
+
 			/* check for garbage data */
 			if (!PageHeaderIsValid((PageHeader) bufBlock))
 			{
@@ -1874,6 +1892,7 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcontext;
+	instr_time io_start, io_end;
 
 	/*
 	 * Acquire the buffer's io_in_progress lock.  If StartBufferIO returns
@@ -1921,12 +1940,21 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 	buf->flags &= ~BM_JUST_DIRTIED;
 	UnlockBufHdr(buf);
 
+	if (track_iotiming)
+		INSTR_TIME_SET_CURRENT(io_start);
+
 	smgrwrite(reln,
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  (char *) BufHdrGetBlock(buf),
 			  false);
 
+	if (track_iotiming)
+	{
+		INSTR_TIME_SET_CURRENT(io_end);
+		INSTR_TIME_ACCUM_DIFF(pgBufferUsage.time_write, io_end, io_start);
+	}
+
 	pgBufferUsage.shared_blks_written++;
 
 	/*
@@ -2670,6 +2698,7 @@ WaitIO(volatile BufferDesc *buf)
 static bool
 StartBufferIO(volatile BufferDesc *buf, bool forInput)
 {
+	instr_time wait_start, wait_end;
 	Assert(!InProgressBuf);
 
 	for (;;)
@@ -2678,7 +2707,28 @@ StartBufferIO(volatile BufferDesc *buf, bool forInput)
 		 * Grab the io_in_progress lock so that other processes can wait for
 		 * me to finish the I/O.
 		 */
-		LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+		if (forInput && track_iotiming) {
+			/*
+			 * We need to time the lock wait to account for I/O waits where
+			 * someone else is doing the work for us. Conditional acquire
+			 * avoids double timing overhead when we do the I/O ourselves.
+			 */
+			if (!LWLockConditionalAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE)) {
+				INSTR_TIME_SET_CURRENT(wait_start);
+
+				LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+
+				/*
+				 * Only do backend local accounting, stats collector will get the
+				 * wait from the backend doing the I/O.
+				 */
+				INSTR_TIME_SET_CURRENT(wait_end);
+				INSTR_TIME_ACCUM_DIFF(pgBufferUsage.time_read, wait_end, wait_start);
+			}
+		} else {
+			LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+		}
+
 
 		LockBufHdr(buf);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 68b2527..d0b6c57 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -22,6 +22,7 @@
 #include "utils/builtins.h"
 #include "utils/inet.h"
 #include "utils/timestamp.h"
+#include "portability/instr_time.h"
 
 /* bogus ... these externs should be in a header file */
 extern Datum pg_stat_get_numscans(PG_FUNCTION_ARGS);
@@ -35,6 +36,7 @@ extern Datum pg_stat_get_live_tuples(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_dead_tuples(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_blocks_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_blocks_hit(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_blocks_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_last_vacuum_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_last_autovacuum_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_last_analyze_time(PG_FUNCTION_ARGS);
@@ -67,6 +69,7 @@ extern Datum pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_db_blocks_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS);
@@ -102,6 +105,7 @@ extern Datum pg_stat_get_xact_tuples_deleted(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_tuples_hot_updated(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_blocks_fetched(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_blocks_hit(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_xact_blocks_time(PG_FUNCTION_ARGS);
 
 extern Datum pg_stat_get_xact_function_calls(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_function_time(PG_FUNCTION_ARGS);
@@ -292,6 +296,22 @@ pg_stat_get_blocks_hit(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_blocks_time(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	double		result;
+	PgStat_StatTabEntry *tabentry;
+
+	if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+		result = 0;
+	else
+		/* Cast overflows in about 300'000 years of io time */
+		result = (int64) INSTR_TIME_GET_MICROSEC(tabentry->blocks_time);
+
+	PG_RETURN_INT64(result);
+}
+
+Datum
 pg_stat_get_last_vacuum_time(PG_FUNCTION_ARGS)
 {
 	Oid			relid = PG_GETARG_OID(0);
@@ -1120,6 +1140,22 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
 
 
 Datum
+pg_stat_get_db_blocks_time(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) INSTR_TIME_GET_MICROSEC(dbentry->blocks_time);
+
+	PG_RETURN_INT64(result);
+}
+
+
+Datum
 pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
 {
 	Oid			dbid = PG_GETARG_OID(0);
@@ -1565,6 +1601,21 @@ pg_stat_get_xact_blocks_hit(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_xact_blocks_time(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_TableStatus *tabentry;
+
+	if ((tabentry = find_tabstat_entry(relid)) == NULL)
+		result = 0;
+	else
+		result = (int64) INSTR_TIME_GET_MICROSEC(tabentry->t_counts.t_blocks_time);
+
+	PG_RETURN_INT64(result);
+}
+
+Datum
 pg_stat_get_xact_function_calls(PG_FUNCTION_ARGS)
 {
 	Oid			funcid = PG_GETARG_OID(0);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3d2fe3e..8ea391a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1018,6 +1018,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_iotiming", PGC_SUSET, STATS_COLLECTOR,
+			gettext_noop("Collects timing information for database IO activity."),
+			NULL
+		},
+		&track_iotiming,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, STATS_COLLECTOR,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 96da086..cbae3bf 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -424,6 +424,7 @@
 
 #track_activities = on
 #track_counts = on
+#track_iotiming = off
 #track_functions = none			# none, pl, all
 #track_activity_query_size = 1024 	# (change requires restart)
 #update_process_title = on
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 2db8489..c4ac58f 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2680,6 +2680,12 @@ DATA(insert OID = 3063 ( pg_stat_get_buf_fsync_backend PGNSP PGUID 12 1 0 0 0 f
 DESCR("statistics: number of backend buffer writes that did their own fsync");
 DATA(insert OID = 2859 ( pg_stat_get_buf_alloc			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_alloc _null_ _null_ _null_ ));
 DESCR("statistics: number of buffer allocations");
+DATA(insert OID = 3160 (  pg_stat_get_blocks_time		PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_blocks_time _null_ _null_ _null_ ));
+DESCR("statistics: duration of block reads");
+DATA(insert OID = 3161 (  pg_stat_get_db_blocks_time	PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_db_blocks_time _null_ _null_ _null_ ));
+DESCR("statistics: duration of block reads for database");
+DATA(insert OID = 3166 (  pg_stat_get_xact_blocks_time	PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_xact_blocks_time _null_ _null_ _null_ ));
+DESCR("statistics: duration of block reads in current transaction");
 
 DATA(insert OID = 2978 (  pg_stat_get_function_calls		PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_function_calls _null_ _null_ _null_ ));
 DESCR("statistics: number of function calls");
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 066f684..ad13914 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -28,6 +28,8 @@ typedef struct BufferUsage
 	long		local_blks_written;		/* # of local disk blocks written */
 	long		temp_blks_read; /* # of temp blocks read */
 	long		temp_blks_written;		/* # of temp blocks written */
+	instr_time	time_read;	/* time spent reading */
+	instr_time	time_write;	/* time spent writing */
 } BufferUsage;
 
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1281bd8..175b5df 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -97,6 +97,7 @@ typedef struct PgStat_TableCounts
 
 	PgStat_Counter t_blocks_fetched;
 	PgStat_Counter t_blocks_hit;
+	instr_time     t_blocks_time;
 } PgStat_TableCounts;
 
 /* Possible targets for resetting cluster-wide shared values */
@@ -522,6 +523,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_xact_rollback;
 	PgStat_Counter n_blocks_fetched;
 	PgStat_Counter n_blocks_hit;
+	instr_time	blocks_time;
 	PgStat_Counter n_tuples_returned;
 	PgStat_Counter n_tuples_fetched;
 	PgStat_Counter n_tuples_inserted;
@@ -573,6 +575,7 @@ typedef struct PgStat_StatTabEntry
 
 	PgStat_Counter blocks_fetched;
 	PgStat_Counter blocks_hit;
+	instr_time	blocks_time;
 
 	TimestampTz vacuum_timestamp;		/* user initiated vacuum */
 	PgStat_Counter vacuum_count;
@@ -816,6 +819,13 @@ extern void pgstat_initstats(Relation rel);
 		if ((rel)->pgstat_info != NULL)								\
 			(rel)->pgstat_info->t_counts.t_blocks_hit++;			\
 	} while (0)
+#define pgstat_count_buffer_time(rel, io)							\
+	do {															\
+		if ((rel)->pgstat_info != NULL)								\
+			INSTR_TIME_ADD(											\
+				(rel)->pgstat_info->t_counts.t_blocks_time,			\
+				(io));												\
+	} while (0)
 
 extern void pgstat_count_heap_insert(Relation rel, int n);
 extern void pgstat_count_heap_update(Relation rel, bool hot);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index de1bbd0..d72bad9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -48,6 +48,7 @@ extern PGDLLIMPORT int NBuffers;
 extern bool zero_damaged_pages;
 extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
+extern bool track_iotiming;
 extern int	target_prefetch_pages;
 
 /* in buf_init.c */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index aaf0cca..ce6228c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1296,7 +1296,7 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
  pg_stat_all_indexes             | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, pg_stat_get_numscans(i.oid) AS idx_scan, pg_stat_get_tuples_returned(i.oid) AS idx_tup_read, pg_stat_get_tuples_fetched(i.oid) AS idx_tup_fetch FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
  pg_stat_all_tables              | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, pg_stat_get_numscans(c.oid) AS seq_scan, pg_stat_get_tuples_returned(c.oid) AS seq_tup_read, (sum(pg_stat_get_numscans(i.indexrelid)))::bigint AS idx_scan, ((sum(pg_stat_get_tuples_fetched(i.indexrelid)))::bigint + pg_stat_get_tuples_fetched(c.oid)) AS idx_tup_fetch, pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins, pg_stat_get_tuples_updated(c.oid) AS n_tup_upd, pg_stat_get_tuples_deleted(c.oid) AS n_tup_del, pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd, pg_stat_get_live_tuples(c.oid) AS n_live_tup, pg_stat_get_dead_tuples(c.oid) AS n_dead_tup, pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum, pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum, pg_stat_get_last_analyze_time(c.oid) AS last_analyze, pg_stat_get_last_autoanalyze_time(c.oid) AS last_autoanalyze, pg_stat_get_vacuum_count(c.oid) AS vacuum_count, pg_stat_get_autovacuum_count(c.oid) AS autovacuum_count, pg_stat_get_analyze_count(c.oid) AS analyze_count, pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count FROM ((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname;
  pg_stat_bgwriter                | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
- pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_temp_files(d.oid) AS temp_files, pg_stat_get_db_temp_bytes(d.oid) AS temp_bytes, pg_stat_get_db_deadlocks(d.oid) AS deadlocks, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
+ pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_blocks_time(d.oid) AS blks_time, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_temp_files(d.oid) AS temp_files, pg_stat_get_db_temp_bytes(d.oid) AS temp_bytes, pg_stat_get_db_deadlocks(d.oid) AS deadlocks, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
  pg_stat_database_conflicts      | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
  pg_stat_replication             | SELECT s.pid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_hostname, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.replay_location, w.sync_priority, w.sync_state FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, waiting, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port), pg_authid u, pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) WHERE ((s.usesysid = u.oid) AND (s.pid = w.pid));
  pg_stat_sys_indexes             | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
@@ -1308,15 +1308,15 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
  pg_stat_xact_sys_tables         | SELECT pg_stat_xact_all_tables.relid, pg_stat_xact_all_tables.schemaname, pg_stat_xact_all_tables.relname, pg_stat_xact_all_tables.seq_scan, pg_stat_xact_all_tables.seq_tup_read, pg_stat_xact_all_tables.idx_scan, pg_stat_xact_all_tables.idx_tup_fetch, pg_stat_xact_all_tables.n_tup_ins, pg_stat_xact_all_tables.n_tup_upd, pg_stat_xact_all_tables.n_tup_del, pg_stat_xact_all_tables.n_tup_hot_upd FROM pg_stat_xact_all_tables WHERE ((pg_stat_xact_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_xact_all_tables.schemaname ~ '^pg_toast'::text));
  pg_stat_xact_user_functions     | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_xact_function_calls(p.oid) AS calls, (pg_stat_get_xact_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_xact_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_xact_function_calls(p.oid) IS NOT NULL));
  pg_stat_xact_user_tables        | SELECT pg_stat_xact_all_tables.relid, pg_stat_xact_all_tables.schemaname, pg_stat_xact_all_tables.relname, pg_stat_xact_all_tables.seq_scan, pg_stat_xact_all_tables.seq_tup_read, pg_stat_xact_all_tables.idx_scan, pg_stat_xact_all_tables.idx_tup_fetch, pg_stat_xact_all_tables.n_tup_ins, pg_stat_xact_all_tables.n_tup_upd, pg_stat_xact_all_tables.n_tup_del, pg_stat_xact_all_tables.n_tup_hot_upd FROM pg_stat_xact_all_tables WHERE ((pg_stat_xact_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_xact_all_tables.schemaname !~ '^pg_toast'::text));
- pg_statio_all_indexes           | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, (pg_stat_get_blocks_fetched(i.oid) - pg_stat_get_blocks_hit(i.oid)) AS idx_blks_read, pg_stat_get_blocks_hit(i.oid) AS idx_blks_hit FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
- pg_statio_all_sequences         | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS blks_read, pg_stat_get_blocks_hit(c.oid) AS blks_hit FROM (pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = 'S'::"char");
- pg_statio_all_tables            | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS heap_blks_read, pg_stat_get_blocks_hit(c.oid) AS heap_blks_hit, (sum((pg_stat_get_blocks_fetched(i.indexrelid) - pg_stat_get_blocks_hit(i.indexrelid))))::bigint AS idx_blks_read, (sum(pg_stat_get_blocks_hit(i.indexrelid)))::bigint AS idx_blks_hit, (pg_stat_get_blocks_fetched(t.oid) - pg_stat_get_blocks_hit(t.oid)) AS toast_blks_read, pg_stat_get_blocks_hit(t.oid) AS toast_blks_hit, (pg_stat_get_blocks_fetched(x.oid) - pg_stat_get_blocks_hit(x.oid)) AS tidx_blks_read, pg_stat_get_blocks_hit(x.oid) AS tidx_blks_hit FROM ((((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_class t ON ((c.reltoastrelid = t.oid))) LEFT JOIN pg_class x ON ((t.reltoastidxid = x.oid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname, t.oid, x.oid;
- pg_statio_sys_indexes           | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_indexes.schemaname ~ '^pg_toast'::text));
- pg_statio_sys_sequences         | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_sequences.schemaname ~ '^pg_toast'::text));
- pg_statio_sys_tables            | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_tables.schemaname ~ '^pg_toast'::text));
- pg_statio_user_indexes          | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_indexes.schemaname !~ '^pg_toast'::text));
- pg_statio_user_sequences        | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_sequences.schemaname !~ '^pg_toast'::text));
- pg_statio_user_tables           | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_tables.schemaname !~ '^pg_toast'::text));
+ pg_statio_all_indexes           | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname, i.relname AS indexrelname, (pg_stat_get_blocks_fetched(i.oid) - pg_stat_get_blocks_hit(i.oid)) AS idx_blks_read, pg_stat_get_blocks_hit(i.oid) AS idx_blks_hit, pg_stat_get_blocks_time(i.oid) AS idx_blks_time FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN pg_class i ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"]));
+ pg_statio_all_sequences         | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS blks_read, pg_stat_get_blocks_hit(c.oid) AS blks_hit, pg_stat_get_blocks_time(c.oid) AS blks_time FROM (pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = 'S'::"char");
+ pg_statio_all_tables            | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS heap_blks_read, pg_stat_get_blocks_hit(c.oid) AS heap_blks_hit, pg_stat_get_blocks_time(c.oid) AS heap_blks_time, (sum((pg_stat_get_blocks_fetched(i.indexrelid) - pg_stat_get_blocks_hit(i.indexrelid))))::bigint AS idx_blks_read, (sum(pg_stat_get_blocks_hit(i.indexrelid)))::bigint AS idx_blks_hit, (sum(pg_stat_get_blocks_time(i.indexrelid)))::bigint AS idx_blks_time, (pg_stat_get_blocks_fetched(t.oid) - pg_stat_get_blocks_hit(t.oid)) AS toast_blks_read, pg_stat_get_blocks_hit(t.oid) AS toast_blks_hit, pg_stat_get_blocks_time(t.oid) AS toast_blks_time, (pg_stat_get_blocks_fetched(x.oid) - pg_stat_get_blocks_hit(x.oid)) AS tidx_blks_read, pg_stat_get_blocks_hit(x.oid) AS tidx_blks_hit, pg_stat_get_blocks_time(x.oid) AS tidx_blks_time FROM ((((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_class t ON ((c.reltoastrelid = t.oid))) LEFT JOIN pg_class x ON ((t.reltoastidxid = x.oid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname, t.oid, x.oid;
+ pg_statio_sys_indexes           | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit, pg_statio_all_indexes.idx_blks_time FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_indexes.schemaname ~ '^pg_toast'::text));
+ pg_statio_sys_sequences         | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit, pg_statio_all_sequences.blks_time FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_sequences.schemaname ~ '^pg_toast'::text));
+ pg_statio_sys_tables            | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.heap_blks_time, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.idx_blks_time, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.toast_blks_time, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit, pg_statio_all_tables.tidx_blks_time FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_statio_all_tables.schemaname ~ '^pg_toast'::text));
+ pg_statio_user_indexes          | SELECT pg_statio_all_indexes.relid, pg_statio_all_indexes.indexrelid, pg_statio_all_indexes.schemaname, pg_statio_all_indexes.relname, pg_statio_all_indexes.indexrelname, pg_statio_all_indexes.idx_blks_read, pg_statio_all_indexes.idx_blks_hit, pg_statio_all_indexes.idx_blks_time FROM pg_statio_all_indexes WHERE ((pg_statio_all_indexes.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_indexes.schemaname !~ '^pg_toast'::text));
+ pg_statio_user_sequences        | SELECT pg_statio_all_sequences.relid, pg_statio_all_sequences.schemaname, pg_statio_all_sequences.relname, pg_statio_all_sequences.blks_read, pg_statio_all_sequences.blks_hit, pg_statio_all_sequences.blks_time FROM pg_statio_all_sequences WHERE ((pg_statio_all_sequences.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_sequences.schemaname !~ '^pg_toast'::text));
+ pg_statio_user_tables           | SELECT pg_statio_all_tables.relid, pg_statio_all_tables.schemaname, pg_statio_all_tables.relname, pg_statio_all_tables.heap_blks_read, pg_statio_all_tables.heap_blks_hit, pg_statio_all_tables.heap_blks_time, pg_statio_all_tables.idx_blks_read, pg_statio_all_tables.idx_blks_hit, pg_statio_all_tables.idx_blks_time, pg_statio_all_tables.toast_blks_read, pg_statio_all_tables.toast_blks_hit, pg_statio_all_tables.toast_blks_time, pg_statio_all_tables.tidx_blks_read, pg_statio_all_tables.tidx_blks_hit, pg_statio_all_tables.tidx_blks_time FROM pg_statio_all_tables WHERE ((pg_statio_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_statio_all_tables.schemaname !~ '^pg_toast'::text));
  pg_stats                        | SELECT n.nspname AS schemaname, c.relname AS tablename, a.attname, s.stainherit AS inherited, s.stanullfrac AS null_frac, s.stawidth AS avg_width, s.stadistinct AS n_distinct, CASE WHEN (s.stakind1 = 1) THEN s.stavalues1 WHEN (s.stakind2 = 1) THEN s.stavalues2 WHEN (s.stakind3 = 1) THEN s.stavalues3 WHEN (s.stakind4 = 1) THEN s.stavalues4 WHEN (s.stakind5 = 1) THEN s.stavalues5 ELSE NULL::anyarray END AS most_common_vals, CASE WHEN (s.stakind1 = 1) THEN s.stanumbers1 WHEN (s.stakind2 = 1) THEN s.stanumbers2 WHEN (s.stakind3 = 1) THEN s.stanumbers3 WHEN (s.stakind4 = 1) THEN s.stanumbers4 WHEN (s.stakind5 = 1) THEN s.stanumbers5 ELSE NULL::real[] END AS most_common_freqs, CASE WHEN (s.stakind1 = 2) THEN s.stavalues1 WHEN (s.stakind2 = 2) THEN s.stavalues2 WHEN (s.stakind3 = 2) THEN s.stavalues3 WHEN (s.stakind4 = 2) THEN s.stavalues4 WHEN (s.stakind5 = 2) THEN s.stavalues5 ELSE NULL::anyarray END AS histogram_bounds, CASE WHEN (s.stakind1 = 3) THEN s.stanumbers1[1] WHEN (s.stakind2 = 3) THEN s.stanumbers2[1] WHEN (s.stakind3 = 3) THEN s.stanumbers3[1] WHEN (s.stakind4 = 3) THEN s.stanumbers4[1] WHEN (s.stakind5 = 3) THEN s.stanumbers5[1] ELSE NULL::real END AS correlation, CASE WHEN (s.stakind1 = 4) THEN s.stavalues1 WHEN (s.stakind2 = 4) THEN s.stavalues2 WHEN (s.stakind3 = 4) THEN s.stavalues3 WHEN (s.stakind4 = 4) THEN s.stavalues4 WHEN (s.stakind5 = 4) THEN s.stavalues5 ELSE NULL::anyarray END AS most_common_elems, CASE WHEN (s.stakind1 = 4) THEN s.stanumbers1 WHEN (s.stakind2 = 4) THEN s.stanumbers2 WHEN (s.stakind3 = 4) THEN s.stanumbers3 WHEN (s.stakind4 = 4) THEN s.stanumbers4 WHEN (s.stakind5 = 4) THEN s.stanumbers5 ELSE NULL::real[] END AS most_common_elem_freqs, CASE WHEN (s.stakind1 = 5) THEN s.stanumbers1 WHEN (s.stakind2 = 5) THEN s.stanumbers2 WHEN (s.stakind3 = 5) THEN s.stanumbers3 WHEN (s.stakind4 = 5) THEN s.stanumbers4 WHEN (s.stakind5 = 5) THEN s.stanumbers5 ELSE NULL::real[] END AS elem_count_histogram FROM (((pg_statistic s JOIN pg_class c ON ((c.oid = s.starelid))) JOIN pg_attribute a ON (((c.oid = a.attrelid) AND (a.attnum = s.staattnum)))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE ((NOT a.attisdropped) AND has_column_privilege(c.oid, a.attnum, 'select'::text));
  pg_tables                       | SELECT n.nspname AS schemaname, c.relname AS tablename, pg_get_userbyid(c.relowner) AS tableowner, t.spcname AS tablespace, c.relhasindex AS hasindexes, c.relhasrules AS hasrules, c.relhastriggers AS hastriggers FROM ((pg_class c LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) LEFT JOIN pg_tablespace t ON ((t.oid = c.reltablespace))) WHERE (c.relkind = 'r'::"char");
  pg_timezone_abbrevs             | SELECT pg_timezone_abbrevs.abbrev, pg_timezone_abbrevs.utc_offset, pg_timezone_abbrevs.is_dst FROM pg_timezone_abbrevs() pg_timezone_abbrevs(abbrev, utc_offset, is_dst);

io-stats-statement.v5.patchtext/x-patch; charset=US-ASCII; name=io-stats-statement.v5.patchDownload

diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
index 223271d..f976419 100644
--- a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
@@ -28,7 +28,9 @@ CREATE FUNCTION pg_stat_statements(
     OUT local_blks_dirtied int8,
     OUT local_blks_written int8,
     OUT temp_blks_read int8,
-    OUT temp_blks_written int8
+    OUT temp_blks_written int8,
+    OUT time_read float8,
+    OUT time_write float8
 )
 RETURNS SETOF record
 AS 'MODULE_PATHNAME'
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
index 1233736..f4bdf12 100644
--- a/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
@@ -25,7 +25,9 @@ CREATE FUNCTION pg_stat_statements(
     OUT local_blks_dirtied int8,
     OUT local_blks_written int8,
     OUT temp_blks_read int8,
-    OUT temp_blks_written int8
+    OUT temp_blks_written int8,
+    OUT time_read float8,
+    OUT time_write float8
 )
 RETURNS SETOF record
 AS 'MODULE_PATHNAME'
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 914fbf2..8e34740 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -85,6 +85,8 @@ typedef struct Counters
 	int64		local_blks_written;		/* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;		/* # of temp blocks written */
+	double		time_read;		/* time spent reading in seconds */
+	double		time_write;		/* time spent writing in seconds */
 	double		usage;			/* usage factor */
 } Counters;
 
@@ -618,9 +620,9 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString,
 		instr_time	start;
 		instr_time	duration;
 		uint64		rows = 0;
-		BufferUsage bufusage;
+		BufferUsage bufusage_start, bufusage;
 
-		bufusage = pgBufferUsage;
+		bufusage_start = pgBufferUsage;
 		INSTR_TIME_SET_CURRENT(start);
 
 		nested_level++;
@@ -651,25 +653,29 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString,
 
 		/* calc differences of buffer counters. */
 		bufusage.shared_blks_hit =
-			pgBufferUsage.shared_blks_hit - bufusage.shared_blks_hit;
+			pgBufferUsage.shared_blks_hit - bufusage_start.shared_blks_hit;
 		bufusage.shared_blks_read =
-			pgBufferUsage.shared_blks_read - bufusage.shared_blks_read;
+			pgBufferUsage.shared_blks_read - bufusage_start.shared_blks_read;
 		bufusage.shared_blks_dirtied =
-			pgBufferUsage.shared_blks_dirtied - bufusage.shared_blks_dirtied;
+			pgBufferUsage.shared_blks_dirtied - bufusage_start.shared_blks_dirtied;
 		bufusage.shared_blks_written =
-			pgBufferUsage.shared_blks_written - bufusage.shared_blks_written;
+			pgBufferUsage.shared_blks_written - bufusage_start.shared_blks_written;
 		bufusage.local_blks_hit =
-			pgBufferUsage.local_blks_hit - bufusage.local_blks_hit;
+			pgBufferUsage.local_blks_hit - bufusage_start.local_blks_hit;
 		bufusage.local_blks_read =
-			pgBufferUsage.local_blks_read - bufusage.local_blks_read;
+			pgBufferUsage.local_blks_read - bufusage_start.local_blks_read;
 		bufusage.local_blks_dirtied =
-			pgBufferUsage.local_blks_dirtied - bufusage.local_blks_dirtied;
+			pgBufferUsage.local_blks_dirtied - bufusage_start.local_blks_dirtied;
 		bufusage.local_blks_written =
-			pgBufferUsage.local_blks_written - bufusage.local_blks_written;
+			pgBufferUsage.local_blks_written - bufusage_start.local_blks_written;
 		bufusage.temp_blks_read =
-			pgBufferUsage.temp_blks_read - bufusage.temp_blks_read;
+			pgBufferUsage.temp_blks_read - bufusage_start.temp_blks_read;
 		bufusage.temp_blks_written =
-			pgBufferUsage.temp_blks_written - bufusage.temp_blks_written;
+			pgBufferUsage.temp_blks_written - bufusage_start.temp_blks_written;
+		bufusage.time_read = pgBufferUsage.time_read;
+		INSTR_TIME_SUBTRACT(bufusage.time_read, bufusage_start.time_read);
+		bufusage.time_write = pgBufferUsage.time_write;
+		INSTR_TIME_SUBTRACT(bufusage.time_write, bufusage_start.time_write);
 
 		pgss_store(queryString, INSTR_TIME_GET_DOUBLE(duration), rows,
 				   &bufusage);
@@ -780,6 +786,8 @@ pgss_store(const char *query, double total_time, uint64 rows,
 		e->counters.local_blks_written += bufusage->local_blks_written;
 		e->counters.temp_blks_read += bufusage->temp_blks_read;
 		e->counters.temp_blks_written += bufusage->temp_blks_written;
+		e->counters.time_read +=  INSTR_TIME_GET_DOUBLE(bufusage->time_read);
+		e->counters.time_write += INSTR_TIME_GET_DOUBLE(bufusage->time_write);
 		e->counters.usage += usage;
 		SpinLockRelease(&e->mutex);
 	}
@@ -802,7 +810,7 @@ pg_stat_statements_reset(PG_FUNCTION_ARGS)
 }
 
 #define PG_STAT_STATEMENTS_COLS_V1_0	14
-#define PG_STAT_STATEMENTS_COLS			16
+#define PG_STAT_STATEMENTS_COLS			18
 
 /*
  * Retrieve statement statistics.
@@ -819,7 +827,7 @@ pg_stat_statements(PG_FUNCTION_ARGS)
 	bool		is_superuser = superuser();
 	HASH_SEQ_STATUS hash_seq;
 	pgssEntry  *entry;
-	bool		sql_supports_dirty_counters = true;
+	bool		sql_supports_v1_1_counters = true;
 
 	if (!pgss || !pgss_hash)
 		ereport(ERROR,
@@ -841,7 +849,7 @@ pg_stat_statements(PG_FUNCTION_ARGS)
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
 	if (tupdesc->natts == PG_STAT_STATEMENTS_COLS_V1_0)
-		sql_supports_dirty_counters = false;
+		sql_supports_v1_1_counters = false;
 
 	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
 	oldcontext = MemoryContextSwitchTo(per_query_ctx);
@@ -899,18 +907,22 @@ pg_stat_statements(PG_FUNCTION_ARGS)
 		values[i++] = Int64GetDatumFast(tmp.rows);
 		values[i++] = Int64GetDatumFast(tmp.shared_blks_hit);
 		values[i++] = Int64GetDatumFast(tmp.shared_blks_read);
-		if (sql_supports_dirty_counters)
+		if (sql_supports_v1_1_counters)
 			values[i++] = Int64GetDatumFast(tmp.shared_blks_dirtied);
 		values[i++] = Int64GetDatumFast(tmp.shared_blks_written);
 		values[i++] = Int64GetDatumFast(tmp.local_blks_hit);
 		values[i++] = Int64GetDatumFast(tmp.local_blks_read);
-		if (sql_supports_dirty_counters)
+		if (sql_supports_v1_1_counters)
 			values[i++] = Int64GetDatumFast(tmp.local_blks_dirtied);
 		values[i++] = Int64GetDatumFast(tmp.local_blks_written);
 		values[i++] = Int64GetDatumFast(tmp.temp_blks_read);
 		values[i++] = Int64GetDatumFast(tmp.temp_blks_written);
+		if (sql_supports_v1_1_counters) {
+			values[i++] = Float8GetDatumFast(tmp.time_read);
+			values[i++] = Float8GetDatumFast(tmp.time_write);
+		}
 
-		Assert(i == sql_supports_dirty_counters ? \
+		Assert(i == sql_supports_v1_1_counters ? \
 			PG_STAT_STATEMENTS_COLS : PG_STAT_STATEMENTS_COLS_V1_0);
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index ab34ca1..f9da247 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -155,6 +155,20 @@
       <entry>Total number of temp blocks writes by the statement</entry>
      </row>
 
+     <row>
+      <entry><structfield>time_read</structfield></entry>
+      <entry><type>double precision</type></entry>
+      <entry></entry>
+      <entry>Total time spent by the statement for reading blocks, in seconds</entry>
+     </row>
+
+     <row>
+      <entry><structfield>time_write</structfield></entry>
+      <entry><type>double precision</type></entry>
+      <entry></entry>
+      <entry>Total time spent by the statement for writing out dirty blocks, in seconds</entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>

#26

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Ants Aasma (#25)

Re: Patch: add timing of buffer I/O requests

On Thu, Mar 22, 2012 at 7:38 PM, Ants Aasma <ants@cybertec.at> wrote:

On Wed, Mar 21, 2012 at 5:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:

This seems to have bitrotted again. :-(

Can you please rebase again?

Rebased.

I've committed the core of this. I left out the stats collector
stuff, because it's still per-table and I think perhaps we should back
off to just per-database. I changed it so that it does not conflate
wait time with I/O time. Maybe there should be a separate method of
measuring wait time, but I don't think it's a good idea for the
per-backend stats to measure a different thing than what gets reported
up to the stats collector - we should have ONE definition of each
counter. I also tweaked the EXPLAIN output format a bit, and the
docs.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#27

Ants Aasma

ants@cybertec.at

almost 14 years ago

In reply to: Robert Haas (#26)

Re: Patch: add timing of buffer I/O requests

On Tue, Mar 27, 2012 at 9:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I've committed the core of this. I left out the stats collector
stuff, because it's still per-table and I think perhaps we should back
off to just per-database. I changed it so that it does not conflate
wait time with I/O time. Maybe there should be a separate method of
measuring wait time, but I don't think it's a good idea for the
per-backend stats to measure a different thing than what gets reported
up to the stats collector - we should have ONE definition of each
counter. I also tweaked the EXPLAIN output format a bit, and the
docs.

Thank you for working on this.

Taking a fresh look at it, I agree with you that conflating waiting
for backend local IO, and IO performed by some other backend might
paint us into a corner. For most workloads the wait for IO performed
by others should be the minority anyway.

I won't really miss the per table stats. But if the pg_stat_statements
normalisation patch gets commited, it would be really neat to also
have IO waits there. It would be much easier to quickly diagnose "what
is causing all this IO" issues.

Thanks again,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

#28

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Robert Haas (#26)

Re: Patch: add timing of buffer I/O requests

On Tue, Mar 27, 2012 at 2:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 22, 2012 at 7:38 PM, Ants Aasma <ants@cybertec.at> wrote:

On Wed, Mar 21, 2012 at 5:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:

This seems to have bitrotted again. :-(

Can you please rebase again?

Rebased.

I've committed the core of this. I left out the stats collector
stuff, because it's still per-table and I think perhaps we should back
off to just per-database. I changed it so that it does not conflate
wait time with I/O time. Maybe there should be a separate method of
measuring wait time, but I don't think it's a good idea for the
per-backend stats to measure a different thing than what gets reported
up to the stats collector - we should have ONE definition of each
counter. I also tweaked the EXPLAIN output format a bit, and the
docs.

And I've now committed the pg_stat_statements code as well, hopefully
not stomping too badly one what Tom's apparently in the midst of doing
with Peter's pg_stat_statements patch. I committed this almost
exactly as submitted; just a minor code style correction and a few
documentation enhancements.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#29

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Ants Aasma (#27)

Re: Patch: add timing of buffer I/O requests

On Tue, Mar 27, 2012 at 3:22 PM, Ants Aasma <ants@cybertec.at> wrote:

On Tue, Mar 27, 2012 at 9:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I've committed the core of this. I left out the stats collector
stuff, because it's still per-table and I think perhaps we should back
off to just per-database. I changed it so that it does not conflate
wait time with I/O time. Maybe there should be a separate method of
measuring wait time, but I don't think it's a good idea for the
per-backend stats to measure a different thing than what gets reported
up to the stats collector - we should have ONE definition of each
counter. I also tweaked the EXPLAIN output format a bit, and the
docs.

Thank you for working on this.

Taking a fresh look at it, I agree with you that conflating waiting
for backend local IO, and IO performed by some other backend might
paint us into a corner. For most workloads the wait for IO performed
by others should be the minority anyway.

I won't really miss the per table stats. But if the pg_stat_statements
normalisation patch gets commited, it would be really neat to also
have IO waits there. It would be much easier to quickly diagnose "what
is causing all this IO" issues.

So, the pg_stat_statements part of this is committed now. Do you want
to prepare a revised patch to add per-database counters to the
statistics collector? I think that might be a good idea...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#30

Greg Stark

stark@mit.edu

almost 14 years ago

In reply to: Robert Haas (#26)

Re: Patch: add timing of buffer I/O requests

On Tue, Mar 27, 2012 at 7:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I've committed the core of this. I left out the stats collector
stuff, because it's still per-table and I think perhaps we should back
off to just per-database. I changed it so that it does not conflate
wait time with I/O time. Maybe there should be a separate method of
measuring wait time, but I don't think it's a good idea for the
per-backend stats to measure a different thing than what gets reported
up to the stats collector - we should have ONE definition of each
counter. I also tweaked the EXPLAIN output format a bit, and the
docs.

Maybe I missed some earlier discussoin -- I've been having trouble
keeping up with the lists.

But was there discussion of why this is a GUC? Why not just another
parameter to EXPLAIN like the others?
i.e. EXPLAIN (ANALYZE, BUFFERS, IOTIMING)

--
greg

#31

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Greg Stark (#30)

Re: Patch: add timing of buffer I/O requests

On Tue, Mar 27, 2012 at 8:41 PM, Greg Stark <stark@mit.edu> wrote:

On Tue, Mar 27, 2012 at 7:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I've committed the core of this. I left out the stats collector
stuff, because it's still per-table and I think perhaps we should back
off to just per-database. I changed it so that it does not conflate
wait time with I/O time. Maybe there should be a separate method of
measuring wait time, but I don't think it's a good idea for the
per-backend stats to measure a different thing than what gets reported
up to the stats collector - we should have ONE definition of each
counter. I also tweaked the EXPLAIN output format a bit, and the
docs.

Maybe I missed some earlier discussoin -- I've been having trouble
keeping up with the lists.

But was there discussion of why this is a GUC? Why not just another
parameter to EXPLAIN like the others?
i.e. EXPLAIN (ANALYZE, BUFFERS, IOTIMING)

Because you want to be able to expose the data even for queries that
aren't explained. Right now, you can do that with pg_stat_statements;
and the original patch also had per-table counters, but I didn't
commit that part due to some concerns about stats bloat.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#32

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Robert Haas (#29)

Re: Patch: add timing of buffer I/O requests

On Tue, Mar 27, 2012 at 8:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 27, 2012 at 3:22 PM, Ants Aasma <ants@cybertec.at> wrote:

On Tue, Mar 27, 2012 at 9:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I've committed the core of this. I left out the stats collector
stuff, because it's still per-table and I think perhaps we should back
off to just per-database. I changed it so that it does not conflate
wait time with I/O time. Maybe there should be a separate method of
measuring wait time, but I don't think it's a good idea for the
per-backend stats to measure a different thing than what gets reported
up to the stats collector - we should have ONE definition of each
counter. I also tweaked the EXPLAIN output format a bit, and the
docs.

Thank you for working on this.

Taking a fresh look at it, I agree with you that conflating waiting
for backend local IO, and IO performed by some other backend might
paint us into a corner. For most workloads the wait for IO performed
by others should be the minority anyway.

I won't really miss the per table stats. But if the pg_stat_statements
normalisation patch gets commited, it would be really neat to also
have IO waits there. It would be much easier to quickly diagnose "what
is causing all this IO" issues.

So, the pg_stat_statements part of this is committed now. Do you want
to prepare a revised patch to add per-database counters to the
statistics collector? I think that might be a good idea...

Hearing nothing further on this point, I went and did it myself.

In the process I noticed a couple of things that I think we need to fix.

Currently, the statistics views that include timing information are
displayed in milliseconds (see pg_stat_user_functions), while the
underlying SQL-callable functions return the data in microseconds.
Whether or not this was a good design decision, we're stuck with it
now; the documentation implies that the views and functions use the
same units. I'll go fix that next.

Meanwhile, pg_stat_statements converts the same data to seconds but
makes it a double rather than a bigint. I think that was a bad idea
and we should make it consistent use a bigint and milliseconds while
we still can.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#33

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Robert Haas (#32)

Re: Patch: add timing of buffer I/O requests

On Thu, Apr 5, 2012 at 11:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Meanwhile, pg_stat_statements converts the same data to seconds but
makes it a double rather than a bigint. I think that was a bad idea
and we should make it consistent use a bigint and milliseconds while
we still can.

Hmm. So, on further review, this is not as simple as it seems. I'd
like some input from other people on what we should do here.

pg_stat_statements has long exposed a column called "total_time" as a
float8. It now exposes columns "time_read" and "time_write" which are
actually measuring the time spent reading and writing data blocks, and
those are also exposed as a float8; all these count seconds.

Meanwhile, all times exposed by the stats collector (including the new
and analagous pg_stat_database.block_read_time and
pg_stat_database.block_write_time columns) are exposed as int8; these
count milliseconds.

So, should we make the new columns exposed by pg_stat_statements use
milliseconds, so that the block read/write timings are everywhere in
milliseconds, or should we keep them as a float8, so that all the
times exposed by pg_stat_statements use float8?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#34

Tom Lane

tgl@sss.pgh.pa.us

almost 14 years ago

In reply to: Robert Haas (#33)

Re: Patch: add timing of buffer I/O requests

Robert Haas <robertmhaas@gmail.com> writes:

Hmm. So, on further review, this is not as simple as it seems. I'd
like some input from other people on what we should do here.

pg_stat_statements has long exposed a column called "total_time" as a
float8. It now exposes columns "time_read" and "time_write" which are
actually measuring the time spent reading and writing data blocks, and
those are also exposed as a float8; all these count seconds.

Meanwhile, all times exposed by the stats collector (including the new
and analagous pg_stat_database.block_read_time and
pg_stat_database.block_write_time columns) are exposed as int8; these
count milliseconds.

So, should we make the new columns exposed by pg_stat_statements use
milliseconds, so that the block read/write timings are everywhere in
milliseconds, or should we keep them as a float8, so that all the
times exposed by pg_stat_statements use float8?

Given that we've whacked pg_stat_statements' behavior around rather
thoroughly in this release, maybe we could get away with redefining
total_time as being measured in msec rather than sec, thereby aligning
units as msec across the board. It's arguably a smaller deal than the
way we've redefined what the query column contains...

float8 vs int8 is a distinct issue, and probably one that is not as
much of an impact on clients if we change it. It is not hard to predict
that somebody will eventually want sub-msec resolution on these things,
which would suggest that float8 would be the better idea. But perhaps
we could leave that change for a future release.

regards, tom lane

#35

Peter Geoghegan

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#33)

Re: Patch: add timing of buffer I/O requests

On 10 April 2012 14:33, Robert Haas <robertmhaas@gmail.com> wrote:

So, should we make the new columns exposed by pg_stat_statements use
milliseconds, so that the block read/write timings are everywhere in
milliseconds, or should we keep them as a float8, so that all the
times exposed by pg_stat_statements use float8?

I believe that we should keep them as float8, on the basis that a user
is more likely to generalise from total_time's format (when writing a
script to query the view of whatever) than from that of
pg_stat_database.

A part of me would like to change the view definitions so that all the
columns are strongly typed (i.e. all these values would be intervals).
I realise that that isn't practical though.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#36

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Tom Lane (#34)

Re: Patch: add timing of buffer I/O requests

On Tue, Apr 10, 2012 at 10:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Hmm. So, on further review, this is not as simple as it seems. I'd
like some input from other people on what we should do here.

pg_stat_statements has long exposed a column called "total_time" as a
float8. It now exposes columns "time_read" and "time_write" which are
actually measuring the time spent reading and writing data blocks, and
those are also exposed as a float8; all these count seconds.

Meanwhile, all times exposed by the stats collector (including the new
and analagous pg_stat_database.block_read_time and
pg_stat_database.block_write_time columns) are exposed as int8; these
count milliseconds.

So, should we make the new columns exposed by pg_stat_statements use
milliseconds, so that the block read/write timings are everywhere in
milliseconds, or should we keep them as a float8, so that all the
times exposed by pg_stat_statements use float8?

Given that we've whacked pg_stat_statements' behavior around rather
thoroughly in this release, maybe we could get away with redefining
total_time as being measured in msec rather than sec, thereby aligning
units as msec across the board. It's arguably a smaller deal than the
way we've redefined what the query column contains...

Retyping columns is an awfully good way to produce grumpy users. Then
again, if we're going to do it, it would certainly be better to do it
now rather than later, because right now I'm guessing
pg_stat_statements is a lot less heavily used than it will be after
9.2 hits shelves.

float8 vs int8 is a distinct issue, and probably one that is not as
much of an impact on clients if we change it. It is not hard to predict
that somebody will eventually want sub-msec resolution on these things,
which would suggest that float8 would be the better idea. But perhaps
we could leave that change for a future release.

Well, internally, the I/O timing stuff as well as the function timing
stuff use microseconds, and the SQL functions expose it as
microseconds, but then the view divides by 1000 to convert to
milliseconds. I made the I/O timing stuff do it that way because
that's how the function timing stuff does it, but it does seem a
little random. One thing in its favor is that it provides a way for
users to get this if they want it, without screwing readability for
the vast majority who don't care.

On the flip side, the new checkpoint timing stuff is in milliseconds
all the way through, though it seems vanishingly unlikely that anyone
needs more resolution in that case. We have lots of other things in
milliseconds, too.

No matter what we end up doing here it will be consistent with
something; I am reminded of the phrase "the good thing about standards
is that there are so many to choose from...".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#37

Tom Lane

tgl@sss.pgh.pa.us

almost 14 years ago

In reply to: Robert Haas (#36)

Re: Patch: add timing of buffer I/O requests

Robert Haas <robertmhaas@gmail.com> writes:

No matter what we end up doing here it will be consistent with
something; I am reminded of the phrase "the good thing about standards
is that there are so many to choose from...".

Well, FWIW I vote for making the new columns be float8 msec. If you
don't want to change total_time to match, I guess there's no law that
says it *has* to be consistent ...

regards, tom lane

#38

Magnus Hagander

magnus@hagander.net

almost 14 years ago

In reply to: Robert Haas (#36)

Re: Patch: add timing of buffer I/O requests

On Tue, Apr 10, 2012 at 17:58, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 10, 2012 at 10:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Hmm. So, on further review, this is not as simple as it seems. I'd
like some input from other people on what we should do here.

pg_stat_statements has long exposed a column called "total_time" as a
float8. It now exposes columns "time_read" and "time_write" which are
actually measuring the time spent reading and writing data blocks, and
those are also exposed as a float8; all these count seconds.

Meanwhile, all times exposed by the stats collector (including the new
and analagous pg_stat_database.block_read_time and
pg_stat_database.block_write_time columns) are exposed as int8; these
count milliseconds.

So, should we make the new columns exposed by pg_stat_statements use
milliseconds, so that the block read/write timings are everywhere in
milliseconds, or should we keep them as a float8, so that all the
times exposed by pg_stat_statements use float8?

Given that we've whacked pg_stat_statements' behavior around rather
thoroughly in this release, maybe we could get away with redefining
total_time as being measured in msec rather than sec, thereby aligning
units as msec across the board. It's arguably a smaller deal than the
way we've redefined what the query column contains...

Retyping columns is an awfully good way to produce grumpy users. Then
again, if we're going to do it, it would certainly be better to do it
now rather than later, because right now I'm guessing
pg_stat_statements is a lot less heavily used than it will be after
9.2 hits shelves.

Agreed. It's better if we can also change the name of it - provided we
can come up with a reasonable new name. Then peoples applications will
break *visibly*, which is a lot better than breaking invisibly. (This
is the main reason why we renamed current_query in pg_stat_activity..)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#39

Tom Lane

tgl@sss.pgh.pa.us

almost 14 years ago

In reply to: Magnus Hagander (#38)

Re: Patch: add timing of buffer I/O requests

Magnus Hagander <magnus@hagander.net> writes:

On Tue, Apr 10, 2012 at 17:58, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 10, 2012 at 10:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Given that we've whacked pg_stat_statements' behavior around rather
thoroughly in this release, maybe we could get away with redefining
total_time as being measured in msec rather than sec, thereby aligning
units as msec across the board. It's arguably a smaller deal than the
way we've redefined what the query column contains...

Retyping columns is an awfully good way to produce grumpy users. Then
again, if we're going to do it, it would certainly be better to do it
now rather than later, because right now I'm guessing
pg_stat_statements is a lot less heavily used than it will be after
9.2 hits shelves.

Agreed. It's better if we can also change the name of it - provided we
can come up with a reasonable new name. Then peoples applications will
break *visibly*, which is a lot better than breaking invisibly. (This
is the main reason why we renamed current_query in pg_stat_activity..)

That might be overkill. Changing the column name will definitely break
anything more specific than "select * from pg_stat_statements".
However, it's less clear that changing the units in which the column is
expressed will break things. It seems likely to me that nobody out
there is doing anything much more sophisticated than sorting by the
column, and that's still going to work the same.

regards, tom lane

#40

Magnus Hagander

magnus@hagander.net

almost 14 years ago

In reply to: Tom Lane (#39)

Re: Patch: add timing of buffer I/O requests

On Tue, Apr 10, 2012 at 18:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:

On Tue, Apr 10, 2012 at 17:58, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 10, 2012 at 10:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Given that we've whacked pg_stat_statements' behavior around rather
thoroughly in this release, maybe we could get away with redefining
total_time as being measured in msec rather than sec, thereby aligning
units as msec across the board. It's arguably a smaller deal than the
way we've redefined what the query column contains...

Retyping columns is an awfully good way to produce grumpy users. Then
again, if we're going to do it, it would certainly be better to do it
now rather than later, because right now I'm guessing
pg_stat_statements is a lot less heavily used than it will be after
9.2 hits shelves.

Agreed. It's better if we can also change the name of it - provided we
can come up with a reasonable new name. Then peoples applications will
break *visibly*, which is a lot better than breaking invisibly. (This
is the main reason why we renamed current_query in pg_stat_activity..)

That might be overkill. Changing the column name will definitely break
anything more specific than "select * from pg_stat_statements".
However, it's less clear that changing the units in which the column is
expressed will break things. It seems likely to me that nobody out
there is doing anything much more sophisticated than sorting by the
column, and that's still going to work the same.

I've seen cases where the timing is correlated with external timings,
e.g. from the application. Have I seen it a lot? No - but then I
haven't seen a big usage of pg_stat_statements either, which might be
the better argument for allowing a change of unit but not name.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#41

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Tom Lane (#37)

Re: Patch: add timing of buffer I/O requests

On Tue, Apr 10, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

No matter what we end up doing here it will be consistent with
something; I am reminded of the phrase "the good thing about standards
is that there are so many to choose from...".

Well, FWIW I vote for making the new columns be float8 msec. If you
don't want to change total_time to match, I guess there's no law that
says it *has* to be consistent ...

Ugh. So the three ways of doing timing that we have already aren't
enough, and we need a fourth one? Ack!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#42

Tom Lane

tgl@sss.pgh.pa.us

almost 14 years ago

In reply to: Robert Haas (#41)

Re: Patch: add timing of buffer I/O requests

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Apr 10, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, FWIW I vote for making the new columns be float8 msec.

Ugh. So the three ways of doing timing that we have already aren't
enough, and we need a fourth one? Ack!

Huh? I understood what you said upthread to be that we have two ways
in existing releases (anything unreleased has zero standing in this
discussion): float8 sec in pg_stat_statements.total_time, and
int8 msec everywhere else. Did I miss something?

regards, tom lane

#43

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Tom Lane (#42)

Re: Patch: add timing of buffer I/O requests

On Tue, Apr 10, 2012 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Apr 10, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, FWIW I vote for making the new columns be float8 msec.

Ugh. So the three ways of doing timing that we have already aren't
enough, and we need a fourth one? Ack!

Huh? I understood what you said upthread to be that we have two ways
in existing releases (anything unreleased has zero standing in this
discussion): float8 sec in pg_stat_statements.total_time, and
int8 msec everywhere else. Did I miss something?

We also have int8 usec floating around. But even if we didn't, float8
msec would be a new one, regardless of whether it would be third or
fourth...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#44

Tom Lane

tgl@sss.pgh.pa.us

almost 14 years ago

In reply to: Robert Haas (#43)

Re: Patch: add timing of buffer I/O requests

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Apr 10, 2012 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Huh? I understood what you said upthread to be that we have two ways
in existing releases (anything unreleased has zero standing in this
discussion): float8 sec in pg_stat_statements.total_time, and
int8 msec everywhere else. Did I miss something?

We also have int8 usec floating around. But even if we didn't, float8
msec would be a new one, regardless of whether it would be third or
fourth...

It would still be the second one, because it would replace the only use
of float8 sec, no? And more to the point, it converges us on msec being
the only exposed unit.

The business about underlying microseconds is maybe not so good, but
I don't think we want to touch that right now. In the long run
I think it would make sense to converge on float8 msec as being the
standard for exposed timing values, because that is readily adaptable to
the underlying data having nsec or even better precision.

regards, tom lane

#45

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Tom Lane (#44)

Re: Patch: add timing of buffer I/O requests

On Tue, Apr 10, 2012 at 1:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Apr 10, 2012 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Huh? I understood what you said upthread to be that we have two ways
in existing releases (anything unreleased has zero standing in this
discussion): float8 sec in pg_stat_statements.total_time, and
int8 msec everywhere else. Did I miss something?

We also have int8 usec floating around. But even if we didn't, float8
msec would be a new one, regardless of whether it would be third or
fourth...

It would still be the second one, because it would replace the only use
of float8 sec, no? And more to the point, it converges us on msec being
the only exposed unit.

The business about underlying microseconds is maybe not so good, but
I don't think we want to touch that right now. In the long run
I think it would make sense to converge on float8 msec as being the
standard for exposed timing values, because that is readily adaptable to
the underlying data having nsec or even better precision.

Hmm. Maybe we should think about numeric ms, which would have all the
same advantages but without the round-off error.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#46

ktm@rice.edu

almost 14 years ago

In reply to: Robert Haas (#45)

Re: Patch: add timing of buffer I/O requests

On Tue, Apr 10, 2012 at 02:01:02PM -0400, Robert Haas wrote:

On Tue, Apr 10, 2012 at 1:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Apr 10, 2012 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Huh? I understood what you said upthread to be that we have two ways
in existing releases (anything unreleased has zero standing in this
discussion): float8 sec in pg_stat_statements.total_time, and
int8 msec everywhere else. Did I miss something?

We also have int8 usec floating around. But even if we didn't, float8
msec would be a new one, regardless of whether it would be third or
fourth...

It would still be the second one, because it would replace the only use
of float8 sec, no? And more to the point, it converges us on msec being
the only exposed unit.

The business about underlying microseconds is maybe not so good, but
I don't think we want to touch that right now. In the long run
I think it would make sense to converge on float8 msec as being the
standard for exposed timing values, because that is readily adaptable to
the underlying data having nsec or even better precision.

Hmm. Maybe we should think about numeric ms, which would have all the
same advantages but without the round-off error.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

They are also a lot bigger with tons of added overhead. :)

Regards,
Ken

#47

Tom Lane

tgl@sss.pgh.pa.us

almost 14 years ago

In reply to: Robert Haas (#45)

Re: Patch: add timing of buffer I/O requests

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Apr 10, 2012 at 1:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The business about underlying microseconds is maybe not so good, but
I don't think we want to touch that right now. �In the long run
I think it would make sense to converge on float8 msec as being the
standard for exposed timing values, because that is readily adaptable to
the underlying data having nsec or even better precision.

Hmm. Maybe we should think about numeric ms, which would have all the
same advantages but without the round-off error.

Color me unimpressed ... numeric calculations are vastly more expensive
than float, and where are you going to get timing data that has more
than sixteen decimal digits of accuracy? IME we're lucky to get three
repeatable digits in any timing measurement. The point of using a
non-integer type here is not so much precision as dynamic range:
sometimes you might be measuring queries that run for hours, and other
times ones that run for microseconds. In the latter case it's important
to be able to represent nanoseconds, but not so much in the former.

regards, tom lane

#48

Greg Smith

greg@2ndQuadrant.com

almost 14 years ago

In reply to: Tom Lane (#39)

Re: Patch: add timing of buffer I/O requests

On 04/10/2012 12:27 PM, Tom Lane wrote:

Changing the column name will definitely break
anything more specific than "select * from pg_stat_statements".
However, it's less clear that changing the units in which the column is
expressed will break things. It seems likely to me that nobody out
there is doing anything much more sophisticated than sorting by the
column, and that's still going to work the same.

I am doing more sophisticated things with it, so I'll celebrate this as
my opportunity to say I did something you didn't see coming for 2012.

All the sites involved will happily shred those scripts and rewrite for
either normalized queries *or* better I/O timing info though, so net
positive for 9.2 changes even if this part breaks on them. I think this
is one of those rare opportunities where there's enough positive
goodwill from changes to ask "what's the best way to handle this
long-term?" and get away with whatever change that requires, too. I'm
really not liking the look of this wart now that Robert has pointed it out.

I'd prefer to see at least usec resolution and 8 bytes of "dynamic
range" for query related statistics. Any of these would be fine from a
UI perspective to me:

-float8 seconds
-float8 msec
-float8 usec
-int8 usec

I don't think int8 msec will be enough resolution to time queries for
very long, if it's not already obsolete. The committed example for
pg_test_timing on good hardware already clocks trivial events at a
single usec. Even I/O is getting there. I've measured my Fusion-io
loaner card peaking at 8GB/s, which works out to 1 usec per 8K page.
None of that is even price no object hardware today; it's the stuff
sitting in my office.

If anything, I'd expect more timing code in the database that only has
ms resolution right now will start looking fat in a year or two, and
more things might need to be shifted to usec instead. Checkpoint timing
can survive having less resolution because its primary drumbeat is very
unlikely to drop below the minutes range.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#49

Peter Geoghegan

peter@2ndquadrant.com

almost 14 years ago

In reply to: Greg Smith (#48)

Re: Patch: add timing of buffer I/O requests

On 10 April 2012 23:07, Greg Smith <greg@2ndquadrant.com> wrote:

On 04/10/2012 12:27 PM, Tom Lane wrote:
I am doing more sophisticated things with it, so I'll celebrate this as my
opportunity to say I did something you didn't see coming for 2012.

This is why I requested that we expose the query_id hash value - I
believe that it will be generally useful in clustering situations. It
would be nice to have a persistent identifier. While we're discussing
revising pg_stat_statement's interface, are you still opposed to
exposing that value, Tom?

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#50

Tom Lane

tgl@sss.pgh.pa.us

almost 14 years ago

In reply to: Peter Geoghegan (#49)

Re: Patch: add timing of buffer I/O requests

Peter Geoghegan <peter@2ndquadrant.com> writes:

On 10 April 2012 23:07, Greg Smith <greg@2ndquadrant.com> wrote:

On 04/10/2012 12:27 PM, Tom Lane wrote:
I am doing more sophisticated things with it, so I'll celebrate this as my
opportunity to say I did something you didn't see coming for 2012.

This is why I requested that we expose the query_id hash value - I
believe that it will be generally useful in clustering situations. It
would be nice to have a persistent identifier. While we're discussing
revising pg_stat_statement's interface, are you still opposed to
exposing that value, Tom?

I still am. I'm unconvinced by references to "clustering situations",
because as constructed the hash is extremely database-specific.
It will vary depending on OID assignments, not to mention platform
characteristics such as word width and endianness.

regards, tom lane

#51

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Peter Geoghegan (#49)

Re: Patch: add timing of buffer I/O requests

On Tue, Apr 10, 2012 at 6:32 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

On 10 April 2012 23:07, Greg Smith <greg@2ndquadrant.com> wrote:

On 04/10/2012 12:27 PM, Tom Lane wrote:
I am doing more sophisticated things with it, so I'll celebrate this as my
opportunity to say I did something you didn't see coming for 2012.

This is why I requested that we expose the query_id hash value - I
believe that it will be generally useful in clustering situations. It
would be nice to have a persistent identifier. While we're discussing
revising pg_stat_statement's interface, are you still opposed to
exposing that value, Tom?

If people need something like that, couldn't they create it by hashing
the normalized query text with an arbitrary algorithm?

The only obvious advantage of exposing the value used internally is
that it might be helpful in terms of understanding the collision
behavior. But hopefully collisions are pretty rare anyway, so...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#52

Peter Geoghegan

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#51)

Re: Patch: add timing of buffer I/O requests

On 11 April 2012 00:35, Robert Haas <robertmhaas@gmail.com> wrote:

If people need something like that, couldn't they create it by hashing
the normalized query text with an arbitrary algorithm?

That supposes that the normalised query text is perfectly stable. It
may well not be, particularly for things like ad-hoc queries or
queries generated by ORMs, across database clusters and over long
periods of time - you're basically throwing the benefit of all of that
intelligent normalisation out of the window, because it's pretty close
to free to expose it. What if a developer tweaks an alias in the
application for clarity? Also, as you point out, it has additional
utility in advertising when a collision has happened, and setting the
user's expectations appropriately. I assume that collisions are very
rare, but when they do happen, this gives you a fighting chance of
noticing them.

As Tom points out, the query hash will vary according to platform
specific characteristics, including endianness, and will require OIDs
are the same on every node. However, it is still going to be useful in
clusters that use streaming replication, though not a third party
trigger based replication system for example, because streaming
replication does of course require that those factors (and rather a
lot more) will be identical across the cluster anyway. Realistically,
I'd expect a large majority of people interested in this feature to
only want to use it with streaming replication anyway.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#53

Tom Lane

tgl@sss.pgh.pa.us

almost 14 years ago

In reply to: Peter Geoghegan (#52)

Re: Patch: add timing of buffer I/O requests

Peter Geoghegan <peter@2ndquadrant.com> writes:

On 11 April 2012 00:35, Robert Haas <robertmhaas@gmail.com> wrote:

If people need something like that, couldn't they create it by hashing
the normalized query text with an arbitrary algorithm?

That supposes that the normalised query text is perfectly stable. It
may well not be, particularly for things like ad-hoc queries or
queries generated by ORMs, across database clusters and over long
periods of time -

Indeed, but the hash value isn't stable either given those sorts of
assumptions, so I'm not convinced that there's any advantage there.

What I think people would actually like to know, if they're in a
situation where distinct query texts are getting hashed to the same
thing, is *which* different texts got hashed to the same thing.
But there's no good way to expose that given the pg_stat_statements
infrastructure, and exposing the hash value doesn't help.

regards, tom lane

#54

Peter Geoghegan

peter@2ndquadrant.com

almost 14 years ago

In reply to: Tom Lane (#53)

Re: Patch: add timing of buffer I/O requests

On 11 April 2012 01:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <peter@2ndquadrant.com> writes:

On 11 April 2012 00:35, Robert Haas <robertmhaas@gmail.com> wrote:

If people need something like that, couldn't they create it by hashing
the normalized query text with an arbitrary algorithm?

That supposes that the normalised query text is perfectly stable. It
may well not be, particularly for things like ad-hoc queries or
queries generated by ORMs, across database clusters and over long
periods of time -

Indeed, but the hash value isn't stable either given those sorts of
assumptions, so I'm not convinced that there's any advantage there.

Isn't it? The hash captures the true meaning of the query, while
having the database server's platform as a usually irrelevant
artefact. Another thing that I forgot to mention is client encoding -
it may well be fairly inconvenient to have to use the same algorithm
to hash the query string across applications. You also have to hash
the query string yourself again and again, which is expensive to do
from Python or something, and is often inconvenient - differences
beyond track_activity_query_size bytes (default:1024) are not
recognised. Using an SQL code beautifier where a single byte varies
now breaks everything, which developers don't expect at all (we've
trained them not to), so in many ways you're back to the same
limitations as classic pg_stat_statements if you attempt to aggregate
queries over time and across machines, which is a very real use case.

It's probably pretty annoying to have to get your Python app to use
the same hash function as your Java app or whatever I, unless you want
to use something heavyweight like a cryptographic hash function. By
doing it within Postgres, you avoid those headaches.

I'm not asking you to very loudly proclaim that it should be used like
this - just expose it, accurately document it, and I'm quite confident
that it will be widely used and relied upon by those that are
reasonably well informed, and understand its limitations, which are
really quite straightforward.

What I think people would actually like to know, if they're in a
situation where distinct query texts are getting hashed to the same
thing, is *which* different texts got hashed to the same thing.
But there's no good way to expose that given the pg_stat_statements
infrastructure, and exposing the hash value doesn't help.

Apart from detecting the case where we get a straightforward
collision, I don't expect that that would be useful. The whole point
is that the user doesn't care about the difference, and I think we've
specified a practical, widely useful standard for when queries should
be considered equivalent.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#55

ktm@rice.edu

almost 14 years ago

In reply to: Peter Geoghegan (#54)

Re: Patch: add timing of buffer I/O requests

On Wed, Apr 11, 2012 at 01:53:06AM +0100, Peter Geoghegan wrote:

On 11 April 2012 01:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <peter@2ndquadrant.com> writes:

On 11 April 2012 00:35, Robert Haas <robertmhaas@gmail.com> wrote:

If people need something like that, couldn't they create it by hashing
the normalized query text with an arbitrary algorithm?

That supposes that the normalised query text is perfectly stable. It
may well not be, particularly for things like ad-hoc queries or
queries generated by ORMs, across database clusters and over long
periods of time -

Indeed, but the hash value isn't stable either given those sorts of
assumptions, so I'm not convinced that there's any advantage there.

Isn't it? The hash captures the true meaning of the query, while
having the database server's platform as a usually irrelevant
artefact. Another thing that I forgot to mention is client encoding -
it may well be fairly inconvenient to have to use the same algorithm
to hash the query string across applications. You also have to hash
the query string yourself again and again, which is expensive to do
from Python or something, and is often inconvenient - differences
beyond track_activity_query_size bytes (default:1024) are not
recognised. Using an SQL code beautifier where a single byte varies
now breaks everything, which developers don't expect at all (we've
trained them not to), so in many ways you're back to the same
limitations as classic pg_stat_statements if you attempt to aggregate
queries over time and across machines, which is a very real use case.

It's probably pretty annoying to have to get your Python app to use
the same hash function as your Java app or whatever I, unless you want
to use something heavyweight like a cryptographic hash function. By
doing it within Postgres, you avoid those headaches.

I'm not asking you to very loudly proclaim that it should be used like
this - just expose it, accurately document it, and I'm quite confident
that it will be widely used and relied upon by those that are
reasonably well informed, and understand its limitations, which are
really quite straightforward.

What I think people would actually like to know, if they're in a
situation where distinct query texts are getting hashed to the same
thing, is *which* different texts got hashed to the same thing.
But there's no good way to expose that given the pg_stat_statements
infrastructure, and exposing the hash value doesn't help.

Apart from detecting the case where we get a straightforward
collision, I don't expect that that would be useful. The whole point
is that the user doesn't care about the difference, and I think we've
specified a practical, widely useful standard for when queries should
be considered equivalent.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

By using all 64-bits of the hash that we currently calculate, instead
of the current use of 32-bits only, the collision probabilities are
very low.

Regards,
Ken

#56

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: ktm@rice.edu (#55)

Re: Patch: add timing of buffer I/O requests

On Wed, Apr 11, 2012 at 9:02 AM, ktm@rice.edu <ktm@rice.edu> wrote:

By using all 64-bits of the hash that we currently calculate, instead
of the current use of 32-bits only, the collision probabilities are
very low.

That's probably true, but I'm not sure it's worth worrying about -
one-in-four-billion is a pretty small probability.

On the broader issue, Peter's argument here seems to boil down to
"there is probably a reason why this is useful" and Tom's argument
seems to boil down to "i don't want to expose it without knowing what
that reason is". Peter may well be right, but that doesn't make Tom
wrong. If we are going to expose this, we ought to be able to
document why we have it, and right now we can't, because we don't
*really* know what it's good for, other than raising awareness of the
theoretical possibility of collisions, which doesn't seem like quite
enough.

On another note, I think this is a sufficiently major change that we
ought to add Peter's name to the "Author" section of the
pg_stat_statements documentation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#57

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Robert Haas (#56)

Re: Patch: add timing of buffer I/O requests

Robert Haas <robertmhaas@gmail.com> writes:

On another note, I think this is a sufficiently major change that we
ought to add Peter's name to the "Author" section of the
pg_stat_statements documentation.

+1 ... as long as we have those at all, they probably ought to credit
anybody who did substantial work on the module.

I think that eventually we'll have to give them up, just the way that
we don't credit anybody in particular as author of the core code; but
for now this is a good change.

regards, tom lane

#58

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Tom Lane (#57)

Re: Patch: add timing of buffer I/O requests

On Fri, Apr 13, 2012 at 4:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On another note, I think this is a sufficiently major change that we
ought to add Peter's name to the "Author" section of the
pg_stat_statements documentation.

+1 ... as long as we have those at all, they probably ought to credit
anybody who did substantial work on the module.

I think that eventually we'll have to give them up, just the way that
we don't credit anybody in particular as author of the core code; but
for now this is a good change.

Yeah. I'd actually be in favor of removing them, and similar
references in the comments, because they do lead and have led to
disputes about who deserves to be mentioned. However, a change of
this magnitude certainly deserves mention; it's not really the same
module any more.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#59

Peter Geoghegan

peter@2ndquadrant.com

over 13 years ago

In reply to: Robert Haas (#56)

Re: Patch: add timing of buffer I/O requests

On 13 April 2012 20:15, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 11, 2012 at 9:02 AM, ktm@rice.edu <ktm@rice.edu> wrote:

By using all 64-bits of the hash that we currently calculate, instead
of the current use of 32-bits only, the collision probabilities are
very low.

That's probably true, but I'm not sure it's worth worrying about -
one-in-four-billion is a pretty small probability.

That assumes that our hashing of query trees will exhibit uniform
distribution. While it's easy enough to prove that hash_any does that,
it would seem to me that that does not imply that we will exhibit
perfectly uniform distribution within pg_stat_statements. The reason
that I invented the jumble nomenclature to distinguish it from a
straight serialisation is that, apart from the fact that many fields
are simply ignored, we still couldn't deserialize what we do store
from the jumble, because the correct way to serialise a tree entails
serialising NULL pointers too - two non-equivalent jumbles could
actually be bitwise identical. In practice, I think that adding
NodeTags ought to be sufficient here, but I wouldn't like to bet my
life on it. Actually, that is a point that perhaps should have been
touched on in the comments at the top of the file.

You may wish to take a look at my original analysis in the
pg_stat_statements normalisation thread, which attempts to quantify
the odds by drawing upon the mathematics of the birthday problem.

On the broader issue, Peter's argument here seems to boil down to
"there is probably a reason why this is useful" and Tom's argument
seems to boil down to "i don't want to expose it without knowing what
that reason is". Peter may well be right, but that doesn't make Tom
wrong. If we are going to expose this, we ought to be able to
document why we have it, and right now we can't, because we don't
*really* know what it's good for, other than raising awareness of the
theoretical possibility of collisions, which doesn't seem like quite
enough.

Well, it's important to realise that the query string isn't really
stable, to the extent that it could differ as the query is evicted and
re-enters pg_stat_statements over time. The hash value is necessarily
a stable identifier, as it is the very identifier used by
pg_stat_statements. People are going to want to aggregate this
information over long periods and across database clusters, and I
would really like to facilitate that.

To be honest, with the plans that we have for replication, with the
addition of things like cascading replication, I think it will
increasingly be the case that people prefer built-in replication. The
fact that the actual hash value is affected by factors like the
endianness of the architecture in question seems mostly irrelevant.

If we were to expose this, I'd suggest that the documentation in the
table describing each column say of the value:

query_hash | stable identifier used by pg_stat_statements for the query

There'd then be additional clarification after the existing reference
to the hash value, which gave the required caveats about the hash
value being subject to various implementation artefacts that
effectively only make the values persistently indentify queries
referencing particular objects in the same database (that is, the
object OID values are used), or across databases that are binary
clones, such as between a streaming replica master and its standby.
The OID restriction alone effectively shadows all others, so there's
no need to mention endianness or anything like that.

Anyone who jumped to the conclusion that their aggregation tool would
work fine with Slony or something based on the query_hash description
alone would probably have bigger problems.

On another note, I think this is a sufficiently major change that we
ought to add Peter's name to the "Author" section of the
pg_stat_statements documentation.

Thanks. I actually thought this myself, but didn't want to mention it
because I didn't think that it was up to me to decide, or to attempt
to influence that decision.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#60

Jim Nasby

jim@nasby.net

over 13 years ago

In reply to: Greg Smith (#48)

Re: Patch: add timing of buffer I/O requests

On 4/10/12 5:07 PM, Greg Smith wrote:

I'd prefer to see at least usec resolution and 8 bytes of "dynamic range" for query related statistics. Any of these would be fine from a UI perspective to me:

-float8 seconds
-float8 msec
-float8 usec
-int8 usec

I don't think int8 msec will be enough resolution to time queries for very long, if it's not already obsolete. The committed example for pg_test_timing on good hardware already clocks trivial events at a single usec. Even I/O is getting there. I've measured my Fusion-io loaner card peaking at 8GB/s, which works out to 1 usec per 8K page. None of that is even price no object hardware today; it's the stuff sitting in my office.

If anything, I'd expect more timing code in the database that only has ms resolution right now will start looking fat in a year or two, and more things might need to be shifted to usec instead. Checkpoint timing can survive having less resolution because its primary drumbeat is very unlikely to drop below the minutes range.

I agree that ms is on it's way out... and frankly it wouldn't surprise me if at some point we actually had need of ns resolution.

Given that, I don't think ms or us definitions make sense... float8 seconds seams much more logical to me.

Though, if we're going to end up seriously breaking things anyway, perhaps it would make sense to switch everything over to interval... I realize that there's more overhead there, but I don't think selecting from the stats views is exactly performance critical.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#61

Peter Eisentraut

peter_e@gmx.net

over 13 years ago

In reply to: Robert Haas (#33)

Re: Patch: add timing of buffer I/O requests

On tis, 2012-04-10 at 09:33 -0400, Robert Haas wrote:

So, should we make the new columns exposed by pg_stat_statements use
milliseconds, so that the block read/write timings are everywhere in
milliseconds, or should we keep them as a float8, so that all the
times exposed by pg_stat_statements use float8?

Wouldn't interval be the proper type to represent elapsed time?

#62

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Jim Nasby (#60)

Re: Patch: add timing of buffer I/O requests

Jim Nasby <jim@nasby.net> writes:

I agree that ms is on it's way out... and frankly it wouldn't surprise me if at some point we actually had need of ns resolution.

Given that, I don't think ms or us definitions make sense... float8 seconds seams much more logical to me.

Well, the important point is that it be float8, so that fractions of
whatever units you choose can be represented. I do not feel a strong
need to change the units in all the existing pg_stat_ columns from msec
to sec; that strikes me as just breaking things to little gain. If the
majority of them were in sec then I'd want to converge on that, but
since the majority are in msec it seems like the path of least breakage
is to converge on that.

Though, if we're going to end up seriously breaking things anyway,
perhaps it would make sense to switch everything over to interval... I
realize that there's more overhead there, but I don't think selecting
from the stats views is exactly performance critical.

But (a) I *don't* want to seriously break things, and don't see a need
to; (b) interval is expensive and has got its own problems, notably an
internal limitation to usec resolution that we would not be able to get
rid of easily. It's not even apparent to me that interval is
semantically The Right Thing for values that represent an accumulation
of a lot of different measurements.

regards, tom lane

#63

Greg Stark

stark@mit.edu

over 13 years ago

In reply to: Robert Haas (#56)

Re: Patch: add timing of buffer I/O requests

On Fri, Apr 13, 2012 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

That's probably true, but I'm not sure it's worth worrying about -
one-in-four-billion is a pretty small probability.

Is this not subject to the birthday paradox? If you have a given hash
you're worried about a collision with then you have a
one-in-four-billion chance. But if you have a collection of hashes and
you're worried about any collisions then it only takes about 64k
before there's likely a collision.

--
greg

#64

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Greg Stark (#63)

Re: Patch: add timing of buffer I/O requests

Greg Stark <stark@mit.edu> writes:

On Fri, Apr 13, 2012 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

That's probably true, but I'm not sure it's worth worrying about -
one-in-four-billion is a pretty small probability.

Is this not subject to the birthday paradox? If you have a given hash
you're worried about a collision with then you have a
one-in-four-billion chance. But if you have a collection of hashes and
you're worried about any collisions then it only takes about 64k
before there's likely a collision.

... so, if pg_stat_statements.max were set as high as 64k, you would
need to worry.

Realistically, I'm more worried about collisions due to inadequacies in
the jumble calculation logic (Peter already pointed out some risk
factors in that regard).

regards, tom lane

#65

Peter Geoghegan

peter@2ndquadrant.com

over 13 years ago

In reply to: Tom Lane (#64)

Re: Patch: add timing of buffer I/O requests

On 14 April 2012 03:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Realistically, I'm more worried about collisions due to inadequacies in
the jumble calculation logic (Peter already pointed out some risk
factors in that regard).

It's important to have a sense of proportion about the problem. The
worst thing that a collision can do is lead the DBA on a bit of a wild
goose chase. Importantly, collisions across databases and users are
impossible. I've always taken the view that aggregating query
statistics is a lossy process, and these kinds of problems seem like a
more than acceptable price to pay for low-overhead dynamic query
statistics .

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#66

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Tom Lane (#64)

Re: Patch: add timing of buffer I/O requests

On Fri, Apr 13, 2012 at 10:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Greg Stark <stark@mit.edu> writes:

On Fri, Apr 13, 2012 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

That's probably true, but I'm not sure it's worth worrying about -
one-in-four-billion is a pretty small probability.

Is this not subject to the birthday paradox? If you have a given hash
you're worried about a collision with then you have a
one-in-four-billion chance. But if you have a collection of hashes and
you're worried about any collisions then it only takes about 64k
before there's likely a collision.

... so, if pg_stat_statements.max were set as high as 64k, you would
need to worry.

Well... at 64k, you'd be very likely to have a collision. But the
whole birthday paradox thing means that there's a non-trivial
collision probability even at lower numbers of entries. Seems like
maybe we ought to be using 64 bits here...

Realistically, I'm more worried about collisions due to inadequacies in
the jumble calculation logic (Peter already pointed out some risk
factors in that regard).

...especially if collisions are even more frequent than random chance
would suggest.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#67

Greg Smith

greg@2ndQuadrant.com

over 13 years ago

In reply to: Tom Lane (#62)

Re: Patch: add timing of buffer I/O requests

On 04/13/2012 06:22 PM, Tom Lane wrote:

But (a) I *don't* want to seriously break things, and don't see a need
to; (b) interval is expensive and has got its own problems, notably an
internal limitation to usec resolution that we would not be able to get
rid of easily.

A straight float seems pretty future proof compared to a usec resolution
interval. Jim was commenting in the same direction I already did, that
ns resolution is not impossible to see coming.

I also expect to compute plenty of derived statistics from these
numbers. Interval math is good enough that I'm sure such things could
be done, but it seems odd to start with those units. I appreciate that
the interval type has a nice purist feel to it. My pragmatic side says
we're going to pay overhead to create in that type, only to find people
end up converting it right back to other types for easier math tricks.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#68

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Greg Smith (#67)

Re: Patch: add timing of buffer I/O requests

On Sat, Apr 14, 2012 at 3:27 AM, Greg Smith <greg@2ndquadrant.com> wrote:

On 04/13/2012 06:22 PM, Tom Lane wrote:

But (a) I *don't* want to seriously break things, and don't see a need
to; (b) interval is expensive and has got its own problems, notably an
internal limitation to usec resolution that we would not be able to get
rid of easily.

A straight float seems pretty future proof compared to a usec resolution
interval. Jim was commenting in the same direction I already did, that ns
resolution is not impossible to see coming.

I also expect to compute plenty of derived statistics from these numbers.
Interval math is good enough that I'm sure such things could be done, but
it seems odd to start with those units. I appreciate that the interval type
has a nice purist feel to it. My pragmatic side says we're going to pay
overhead to create in that type, only to find people end up converting it
right back to other types for easier math tricks.

I'm still rooting for numeric. As somebody said upthread, performance
ain't critical here; and that lets us whack around the internal
representation however we like without worrying about it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#69

Peter Geoghegan

peter@2ndquadrant.com

over 13 years ago

In reply to: Tom Lane (#47)

Re: Patch: add timing of buffer I/O requests

On 10 April 2012 19:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hmm. Maybe we should think about numeric ms, which would have all the
same advantages but without the round-off error.

Color me unimpressed ... numeric calculations are vastly more expensive
than float, and where are you going to get timing data that has more
than sixteen decimal digits of accuracy?

Besides, how do you propose to solve the problem of storing numerics
in a fixed allocation of shared memory? The only comparable thing in
pg_stat_statements is the query string, which is capped at
track_activity_query_size bytes for this very reason.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#70

Peter Geoghegan

peter@2ndquadrant.com

over 13 years ago

In reply to: Greg Stark (#63)

Re: Patch: add timing of buffer I/O requests

On 14 April 2012 02:42, Greg Stark <stark@mit.edu> wrote:

Is this not subject to the birthday paradox? If you have a given hash
you're worried about a collision with then you have a
one-in-four-billion chance. But if you have a collection of hashes and
you're worried about any collisions then it only takes about 64k
before there's likely a collision.

Just for the sake of the archives, assuming that there is a uniform
distribution of values, by my calculations that puts the probability
of collision at:

pg_stat_statements.max of 1,000 = 0.00011562303995116263

and perhaps more representatively, if we follow the example in the docs:

pg_stat_statements.max of 10,000 = 0.011496378237656368

It's enough of a problem that I'd expect to hear one or two complaints
about it in the next few years, maybe. This is the probability of
there being a collision, not the probability of someone caring about
it.

You probably wouldn't want to push your luck too far:

pg_stat_statements.max of 100,000 = 0.6853509059051395

Even if you did, that would only mean that there was usually one, but
perhaps two or three values that were collisions, out of an entire
100,000. To labour the point, you'd have to have a lot of bad luck for
those to be the values that a human actually ended up caring about.

Jim Nasby said upthread that selecting from the stats view isn't
performance critical, and he's right. However, maintaining the stats
themselves certainly is, since each and every query will have to
update them, adding latency. pg_stat_statements is far from being a
tool of minority interest, particularly now. People are going to want
to add additional bells and whistles to it, which is fine by me, but
I'm very much opposed to making everyone pay for additional features
that imply performance overhead for all queries, particularly if the
feature is of minority interest.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#71

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Peter Geoghegan (#69)

Re: Patch: add timing of buffer I/O requests

On Sat, Apr 14, 2012 at 9:54 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

On 10 April 2012 19:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hmm. Maybe we should think about numeric ms, which would have all the
same advantages but without the round-off error.

Color me unimpressed ... numeric calculations are vastly more expensive
than float, and where are you going to get timing data that has more
than sixteen decimal digits of accuracy?

+1

Besides, how do you propose to solve the problem of storing numerics
in a fixed allocation of shared memory? The only comparable thing in
pg_stat_statements is the query string, which is capped at
track_activity_query_size bytes for this very reason.

The internal representation doesn't have to be (and certainly
shouldn't be) numeric. But if you translate to numeric before
returning the data to the user, then you have the freedom, in the
future, to whack around the internal representation however you like,
without breaking backward compatibility. Choosing float8 for the
external representation is fine as long as we're sure we're not ever
going to want more than 16 significant digits, but I see no particular
value in baking in that assumption. But perhaps, as the saying goes,
16 digits ought to be enough for anyone.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#72

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Robert Haas (#71)

Re: Patch: add timing of buffer I/O requests

Robert Haas <robertmhaas@gmail.com> writes:

The internal representation doesn't have to be (and certainly
shouldn't be) numeric. But if you translate to numeric before
returning the data to the user, then you have the freedom, in the
future, to whack around the internal representation however you like,
without breaking backward compatibility. Choosing float8 for the
external representation is fine as long as we're sure we're not ever
going to want more than 16 significant digits, but I see no particular
value in baking in that assumption. But perhaps, as the saying goes,
16 digits ought to be enough for anyone.

There's no particular reason to think that Moore's Law is going to
result in an increase in the fractional precision of timing data.
It hasn't done so in the past, for sure.

regards, tom lane

#73

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Tom Lane (#72)

Re: Patch: add timing of buffer I/O requests

On Sat, Apr 14, 2012 at 10:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

The internal representation doesn't have to be (and certainly
shouldn't be) numeric. But if you translate to numeric before
returning the data to the user, then you have the freedom, in the
future, to whack around the internal representation however you like,
without breaking backward compatibility. Choosing float8 for the
external representation is fine as long as we're sure we're not ever
going to want more than 16 significant digits, but I see no particular
value in baking in that assumption. But perhaps, as the saying goes,
16 digits ought to be enough for anyone.

There's no particular reason to think that Moore's Law is going to
result in an increase in the fractional precision of timing data.
It hasn't done so in the past, for sure.

Perhaps, but nobody's explained what we gain out of NOT using numeric.
"It's slow" doesn't impress me; selecting from a system view doesn't
need to be lightning-fast.

However, the main thing here is that we need to do *something* here...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#74

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Robert Haas (#73)

Re: Patch: add timing of buffer I/O requests

Robert Haas <robertmhaas@gmail.com> writes:

On Sat, Apr 14, 2012 at 10:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There's no particular reason to think that Moore's Law is going to
result in an increase in the fractional precision of timing data.
It hasn't done so in the past, for sure.

Perhaps, but nobody's explained what we gain out of NOT using numeric.
"It's slow" doesn't impress me; selecting from a system view doesn't
need to be lightning-fast.

Well, how about "the code is going to be quite a lot less readable"?
C can manipulate floats natively, but not numerics.

Also, as was pointed out upthread, the underlying data in shared memory
is almost certainly never going to be infinite-precision; so using
numeric in the API seems to me to be more likely to convey a false
impression of exactness than to do anything useful.

However, the main thing here is that we need to do *something* here...

Agreed, this has got to be pushed forward.

regards, tom lane

#75

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Tom Lane (#74)

Re: Patch: add timing of buffer I/O requests

On Wed, Apr 25, 2012 at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Sat, Apr 14, 2012 at 10:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There's no particular reason to think that Moore's Law is going to
result in an increase in the fractional precision of timing data.
It hasn't done so in the past, for sure.

Perhaps, but nobody's explained what we gain out of NOT using numeric.
"It's slow" doesn't impress me; selecting from a system view doesn't
need to be lightning-fast.

Well, how about "the code is going to be quite a lot less readable"?
C can manipulate floats natively, but not numerics.

Also, as was pointed out upthread, the underlying data in shared memory
is almost certainly never going to be infinite-precision; so using
numeric in the API seems to me to be more likely to convey a false
impression of exactness than to do anything useful.

However, the main thing here is that we need to do *something* here...

Agreed, this has got to be pushed forward.

In the interest of furthering that goal, I propose that whoever is
willing to take the time to clean this up gets to decide what to
standardize on, and I'm happy to give you first crack at that if you
have the cycles.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#76

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Robert Haas (#75)

Re: Patch: add timing of buffer I/O requests

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Apr 25, 2012 at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

However, the main thing here is that we need to do *something* here...

Agreed, this has got to be pushed forward.

In the interest of furthering that goal, I propose that whoever is
willing to take the time to clean this up gets to decide what to
standardize on, and I'm happy to give you first crack at that if you
have the cycles.

OK. I have just returned from some emergency family business, and have
got assorted catching-up to do, but I will try to get to that later
this week.

regards, tom lane

#77

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Tom Lane (#76)

Re: Patch: add timing of buffer I/O requests

On Wed, Apr 25, 2012 at 1:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Apr 25, 2012 at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

However, the main thing here is that we need to do *something* here...

Agreed, this has got to be pushed forward.

In the interest of furthering that goal, I propose that whoever is
willing to take the time to clean this up gets to decide what to
standardize on, and I'm happy to give you first crack at that if you
have the cycles.

OK. I have just returned from some emergency family business, and have
got assorted catching-up to do, but I will try to get to that later
this week.

Sounds good to me. You might want to revisit the issue of how the new
columns in pg_stat_statements are named, as well. I am not sure I'm
happy with that, but neither am I sure that I know what I'd like
better. It's not too clear that the timing is specifically for data
block reads and writes, for example.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#78

Greg Stark

stark@mit.edu

over 13 years ago

In reply to: Tom Lane (#74)

Re: Patch: add timing of buffer I/O requests

On Wed, Apr 25, 2012 at 5:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, as was pointed out upthread, the underlying data in shared memory
is almost certainly never going to be infinite-precision; so using
numeric in the API seems to me to be more likely to convey a false
impression of exactness than to do anything useful.

I don't think that follows. The underlyng data will be measured in
some metric unit of time like microsecond or nanosecond or something
like that. So a base-10 representation will show exactly the precision
that the underlying data has. On the other hand a floating point
number will show a base-2 approximation that may in fact display with
more digits than the underlying data representation has.

--
greg

#79

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Greg Stark (#78)

Re: Patch: add timing of buffer I/O requests

Greg Stark <stark@mit.edu> writes:

On Wed, Apr 25, 2012 at 5:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, as was pointed out upthread, the underlying data in shared memory
is almost certainly never going to be infinite-precision; so using
numeric in the API seems to me to be more likely to convey a false
impression of exactness than to do anything useful.

I don't think that follows. The underlyng data will be measured in
some metric unit of time like microsecond or nanosecond or something
like that. So a base-10 representation will show exactly the precision
that the underlying data has. On the other hand a floating point
number will show a base-2 approximation that may in fact display with
more digits than the underlying data representation has.

My point is that the underlying data is going to be stored in a
fixed-width representation, and therefore it will have accuracy and/or
range limitations that are considerably more severe than use of
"numeric" for output might suggest to the user. In the current
pg_stat_statements code, timings are in fact accumulated in float8,
and emitting them as something other than float8 is just plain
misleading IMHO.

regards, tom lane

#80

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Greg Stark (#78)

Re: Patch: add timing of buffer I/O requests

On Wed, Apr 25, 2012 at 1:58 PM, Greg Stark <stark@mit.edu> wrote:

On Wed, Apr 25, 2012 at 5:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, as was pointed out upthread, the underlying data in shared memory
is almost certainly never going to be infinite-precision; so using
numeric in the API seems to me to be more likely to convey a false
impression of exactness than to do anything useful.

I don't think that follows. The underlyng data will be measured in
some metric unit of time like microsecond or nanosecond or something
like that. So a base-10 representation will show exactly the precision
that the underlying data has. On the other hand a floating point
number will show a base-2 approximation that may in fact display with
more digits than the underlying data representation has.

I wholeheartedly agree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#81

Peter Eisentraut

peter_e@gmx.net

over 13 years ago

In reply to: Robert Haas (#73)

Re: Patch: add timing of buffer I/O requests

On mån, 2012-04-23 at 22:03 -0400, Robert Haas wrote:

Perhaps, but nobody's explained what we gain out of NOT using numeric.

So if you want to have possibly different internal and external
representations, why not use interval for the external one?

#82

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Peter Eisentraut (#81)

Re: Patch: add timing of buffer I/O requests

Peter Eisentraut <peter_e@gmx.net> writes:

On mån, 2012-04-23 at 22:03 -0400, Robert Haas wrote:

Perhaps, but nobody's explained what we gain out of NOT using numeric.

So if you want to have possibly different internal and external
representations, why not use interval for the external one?

That doesn't add any usefulness, only extra complication for clients
that want to do more arithmetic with the values. Also, as was pointed
out earlier, we have a hard-coded restriction to microsecond precision
with the default implementation of interval; and it's not hard to
foresee the day when that won't do.

regards, tom lane

#83

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Tom Lane (#82)

Re: Patch: add timing of buffer I/O requests

On Wed, Apr 25, 2012 at 5:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Eisentraut <peter_e@gmx.net> writes:

On mån, 2012-04-23 at 22:03 -0400, Robert Haas wrote:

Perhaps, but nobody's explained what we gain out of NOT using numeric.

So if you want to have possibly different internal and external
representations, why not use interval for the external one?

That doesn't add any usefulness, only extra complication for clients
that want to do more arithmetic with the values. Also, as was pointed
out earlier, we have a hard-coded restriction to microsecond precision
with the default implementation of interval; and it's not hard to
foresee the day when that won't do.

Agreed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#84

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Robert Haas (#77)

Re: Patch: add timing of buffer I/O requests

Robert Haas <robertmhaas@gmail.com> writes:

... You might want to revisit the issue of how the new
columns in pg_stat_statements are named, as well. I am not sure I'm
happy with that, but neither am I sure that I know what I'd like
better. It's not too clear that the timing is specifically for data
block reads and writes, for example.

Well, the names "time_read" and "time_write" are certainly out of step
with every other stats view in the system; everyplace else, such columns
are named "something_time" (and even in this view itself the other
timing column is "total_time", not "time_total"). So that's got to
change. We could just reverse the word order to "read_time" and
"write_time", or we could do something like "buf_read_time" or
"data_read_time". IIUC block_read_time/block_write_time in the
pg_stat_database view are database-wide totals for the same numbers, so
perhaps the pg_stat_statements column names should be consistent with
those. I am kinda wondering though why those columns spell out "block"
where every single other column name in the stats views uses the
abbreviation "blk".

regards, tom lane

#85

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Tom Lane (#84)

Re: Patch: add timing of buffer I/O requests

... btw, while I'm criticizing names, how about changing
"track_iotiming" to "track_io_timing"? The former seems inelegant and
unreadable.

regards, tom lane

#86

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Tom Lane (#84)

Re: Patch: add timing of buffer I/O requests

On Sat, Apr 28, 2012 at 12:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

... You might want to revisit the issue of how the new
columns in pg_stat_statements are named, as well. I am not sure I'm
happy with that, but neither am I sure that I know what I'd like
better. It's not too clear that the timing is specifically for data
block reads and writes, for example.

Well, the names "time_read" and "time_write" are certainly out of step
with every other stats view in the system; everyplace else, such columns
are named "something_time" (and even in this view itself the other
timing column is "total_time", not "time_total"). So that's got to
change. We could just reverse the word order to "read_time" and
"write_time", or we could do something like "buf_read_time" or
"data_read_time". IIUC block_read_time/block_write_time in the
pg_stat_database view are database-wide totals for the same numbers, so
perhaps the pg_stat_statements column names should be consistent with
those. I am kinda wondering though why those columns spell out "block"
where every single other column name in the stats views uses the
abbreviation "blk".

I like the idea of including the word block in there. I don't think
it was probably a terribly good idea to abbreviate that to blk
everywhere, but at this point it's probably better to be consistent,
sigh.

As for track_iotiming -> track_io_timing, I'm fine with that as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#87

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Robert Haas (#86)

Re: Patch: add timing of buffer I/O requests

Robert Haas <robertmhaas@gmail.com> writes:

I like the idea of including the word block in there. I don't think
it was probably a terribly good idea to abbreviate that to blk
everywhere, but at this point it's probably better to be consistent,
sigh.

As for track_iotiming -> track_io_timing, I'm fine with that as well.

I made these changes, so I think we are done with the naming issues.
However, I'd still like to propose that we think about adjusting the
timing column datatypes, ie uniformly use float8 msec for values
representing elapsed times. By my count there are six columns that
would be affected:

pg_stat_bgwriter.checkpoint_write_time
pg_stat_bgwriter.checkpoint_sync_time
pg_stat_database.blk_read_time
pg_stat_database.blk_write_time
pg_stat_user_functions.total_time
pg_stat_user_functions.self_time

The first four of these are new in 9.2, meaning that we would only be
creating a compatibility issue for the last two. If we wait to do this
in the future, we will have a significantly bigger problem than if we
do it today. Advantages of the change are:

* Better precision exposed to the user (pg_stat_user_functions has
historically provided only millisecond precision).

* Removal of arbitrary limit of microsecond precision. Of course,
the underlying data is still no better than microsecond, but if we
ever are able to migrate to OS APIs that return better-than-microsecond
data, we won't have to adjust the stats view APIs to expose that data.

* A chance to make the functions underlying these stats view columns
agree with the exposed column definitions.

Any objections out there?

regards, tom lane

#88

Greg Stark

stark@mit.edu

over 13 years ago

In reply to: Robert Haas (#86)

Re: Patch: add timing of buffer I/O requests

On Sun, Apr 29, 2012 at 12:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:

As for track_iotiming -> track_io_timing, I'm fine with that as well.

I'm still grumpy about the idea of a GUC changing the explain analyze
output. How would people feel about adding an explain option that
explicitly requests io timing for this explain analyze and then having
the io timing be enabled if either it's requested by explain analyze
or if it's set on globally? That would make it more consistent with
the other explain analyze options?

I realize I don't get to be grumpy without actually contributing
anything, but I'm happy to write up the patch if people agree with the
change.

--
greg

#89

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Greg Stark (#88)

Re: Patch: add timing of buffer I/O requests

Greg Stark <stark@mit.edu> writes:

On Sun, Apr 29, 2012 at 12:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:

As for track_iotiming -> track_io_timing, I'm fine with that as well.

I'm still grumpy about the idea of a GUC changing the explain analyze
output. How would people feel about adding an explain option that
explicitly requests io timing for this explain analyze and then having
the io timing be enabled if either it's requested by explain analyze
or if it's set on globally? That would make it more consistent with
the other explain analyze options?

I think it's going to be hard to decouple that altogether. For
instance, if track_io_timing were not on but you did EXPLAIN (TIMING),
you'd end up with timing info getting sent to the stats collector for
just that one statement. That seems a bit weird too.

I see where you're coming from but I don't think it's a good idea to
add an EXPLAIN option unless you can make the two behaviors (EXPLAIN
reporting and stats collection) truly independent.

regards, tom lane