Showing I/O timings spent reading/writing temp buffers in EXPLAIN

Started by Masahiko Sawadaover 4 years ago25 messages
#1Masahiko Sawada
sawada.mshk@gmail.com
1 attachment(s)

Hi all ,

It's reported on pgsql-bugs[1]/messages/by-id/20210709084355.GA6251@depesz.com that I/O timings in EXPLAIN don't show
the one for temp files. I think it's not a bug but could be an item
for PG15. As mentioned on that thread, this would be useful for users
in a case where temp buffers I/O used most of the time. So I've
written the patch for that. Please note that the patch includes only
to show temp buffer I/O timing to EXPLAIN but not other possibly
related changes such as pg_stat_statement improvements yet.

Before (w/o patch):
postgres(1:14101)=# explain (analyze, buffers) select count(*) from
generate_series(1,100000);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1250.00..1250.01 rows=1 width=8) (actual
time=59.025..59.026 rows=1 loops=1)
Buffers: temp read=171 written=171
-> Function Scan on generate_series (cost=0.00..1000.00
rows=100000 width=0) (actual time=21.695..45.524 rows=100000 loops=1)
Buffers: temp read=171 written=171
Planning Time: 0.041 ms
Execution Time: 70.867 ms
(6 rows)

After (w/ patch):
postgres(1:28754)=# explain (analyze, buffers) select count(*) from
generate_series(1,100000);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1250.00..1250.01 rows=1 width=8) (actual
time=56.189..56.190 rows=1 loops=1)
Buffers: temp read=171 written=171
I/O Timings: temp read=0.487 write=2.073
-> Function Scan on generate_series (cost=0.00..1000.00
rows=100000 width=0) (actual time=21.072..42.886 rows=100000 loops=1)
Buffers: temp read=171 written=171
I/O Timings: temp read=0.487 write=2.073
Planning Time: 0.041 ms
Execution Time: 59.928 ms
(8 rows)

Feedback is very welcome.

Regards,

[1]: /messages/by-id/20210709084355.GA6251@depesz.com

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

v1-0001-Track-I-O-timing-for-temp-buffers.patchapplication/x-patch; name=v1-0001-Track-I-O-timing-for-temp-buffers.patchDownload
From 5f1680e6326358d2f272dd39067c65d819c67164 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 18 Aug 2021 14:55:17 +0900
Subject: [PATCH v1] Track I/O timing for temp buffers

Previously, the output of EXPLAIN with BUFFERS option shows only I/O
timing spent reading and writing shared and local buffers. This
commit shows I/O timing for temp buffers in the output of EXPLAIN. It
is helpful for users in a case where it is the temp buffers I/O that
is using most of the time.

Same as shared/local buffer I/O timings, temp buffers I/O timing is
shown only when track_io_timing is enabled.
---
 doc/src/sgml/ref/explain.sgml      |  8 +++----
 src/backend/commands/explain.c     | 35 +++++++++++++++++++++++-------
 src/backend/executor/instrument.c  |  6 +++++
 src/backend/storage/file/buffile.c | 29 ++++++++++++++++++++++++-
 src/include/executor/instrument.h  |  2 ++
 5 files changed, 67 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 4d758fb237..70c65c4ba1 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -174,10 +174,10 @@ ROLLBACK;
       Include information on buffer usage. Specifically, include the number of
       shared blocks hit, read, dirtied, and written, the number of local blocks
       hit, read, dirtied, and written, the number of temp blocks read and
-      written, and the time spent reading and writing data file blocks
-      (in milliseconds) if <xref linkend="guc-track-io-timing"/> is enabled.
-      A <emphasis>hit</emphasis> means that a read was avoided because the block was
-      found already in cache when needed.
+      written, and the time spent reading and writing data file blocks and
+      temp file blocks (in milliseconds) if <xref linkend="guc-track-io-timing"/>
+      is enabled.  A <emphasis>hit</emphasis> means that a read was avoided because
+      the block was found already in cache when needed.
       Shared blocks contain data from regular tables and indexes;
       local blocks contain data from temporary tables and indexes;
       while temp blocks contain short-term working data used in sorts, hashes,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 10644dfac4..50764519d0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3501,8 +3501,11 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 								usage->temp_blks_written > 0);
 		bool		has_timing = (!INSTR_TIME_IS_ZERO(usage->blk_read_time) ||
 								  !INSTR_TIME_IS_ZERO(usage->blk_write_time));
+		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
+									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
 		bool		show_planning = (planning && (has_shared ||
-												  has_local || has_temp || has_timing));
+												  has_local || has_temp || has_timing ||
+												  has_temp_timing));
 
 		if (show_planning)
 		{
@@ -3567,16 +3570,32 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 		}
 
 		/* As above, show only positive counter values. */
-		if (has_timing)
+		if (has_timing || has_temp_timing)
 		{
 			ExplainIndentText(es);
 			appendStringInfoString(es->str, "I/O Timings:");
-			if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
-				appendStringInfo(es->str, " read=%0.3f",
-								 INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
-			if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
-				appendStringInfo(es->str, " write=%0.3f",
-								 INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
+
+			if (has_timing)
+			{
+				if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
+					appendStringInfo(es->str, " read=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
+				if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
+					appendStringInfo(es->str, " write=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
+				if (has_temp_timing)
+					appendStringInfoChar(es->str, ',');
+			}
+			if (has_temp_timing)
+			{
+				appendStringInfoString(es->str, " temp");
+				if (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time))
+					appendStringInfo(es->str, " read=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+				if (!INSTR_TIME_IS_ZERO(usage->temp_blk_write_time))
+					appendStringInfo(es->str, " write=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+			}
 			appendStringInfoChar(es->str, '\n');
 		}
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2b106d8473..5c3cc4b1cc 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -237,6 +237,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->temp_blks_written += add->temp_blks_written;
 	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
 	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+	INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
+	INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
 }
 
 /* dst += add - sub */
@@ -259,6 +261,10 @@ BufferUsageAccumDiff(BufferUsage *dst,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
+	INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
+						  add->temp_blk_read_time, sub->temp_blk_read_time);
+	INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
+						  add->temp_blk_write_time, sub->temp_blk_write_time);
 }
 
 /* helper functions for WAL usage accumulation */
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index a4be5fe513..0188fbc1b6 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -417,6 +417,8 @@ static void
 BufFileLoadBuffer(BufFile *file)
 {
 	File		thisfile;
+	instr_time	io_start;
+	instr_time	io_time;
 
 	/*
 	 * Advance to next component file if necessary and possible.
@@ -428,10 +430,14 @@ BufFileLoadBuffer(BufFile *file)
 		file->curOffset = 0L;
 	}
 
+	thisfile = file->files[file->curFile];
+
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+
 	/*
 	 * Read whatever we can get, up to a full bufferload.
 	 */
-	thisfile = file->files[file->curFile];
 	file->nbytes = FileRead(thisfile,
 							file->buffer.data,
 							sizeof(file->buffer),
@@ -446,6 +452,13 @@ BufFileLoadBuffer(BufFile *file)
 						FilePathName(thisfile))));
 	}
 
+	if (track_io_timing)
+	{
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, io_start);
+		INSTR_TIME_ADD(pgBufferUsage.temp_blk_read_time, io_time);
+	}
+
 	/* we choose not to advance curOffset here */
 
 	if (file->nbytes > 0)
@@ -473,6 +486,8 @@ BufFileDumpBuffer(BufFile *file)
 	while (wpos < file->nbytes)
 	{
 		off_t		availbytes;
+		instr_time	io_start;
+		instr_time	io_time;
 
 		/*
 		 * Advance to next component file if necessary and possible.
@@ -495,6 +510,10 @@ BufFileDumpBuffer(BufFile *file)
 			bytestowrite = (int) availbytes;
 
 		thisfile = file->files[file->curFile];
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+
 		bytestowrite = FileWrite(thisfile,
 								 file->buffer.data + wpos,
 								 bytestowrite,
@@ -505,6 +524,14 @@ BufFileDumpBuffer(BufFile *file)
 					(errcode_for_file_access(),
 					 errmsg("could not write to file \"%s\": %m",
 							FilePathName(thisfile))));
+
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			INSTR_TIME_ADD(pgBufferUsage.temp_blk_write_time, io_time);
+		}
+
 		file->curOffset += bytestowrite;
 		wpos += bytestowrite;
 
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 2f9905b7c8..6f3cbcabef 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -35,6 +35,8 @@ typedef struct BufferUsage
 	int64		temp_blks_written;	/* # of temp blocks written */
 	instr_time	blk_read_time;	/* time spent reading */
 	instr_time	blk_write_time; /* time spent writing */
+	instr_time	temp_blk_read_time;	/* time spent reading tmp blocks */
+	instr_time	temp_blk_write_time; /* time spent writing tmp blocks */
 } BufferUsage;
 
 /*
-- 
2.24.3 (Apple Git-128)

#2Ranier Vilela
ranier.vf@gmail.com
In reply to: Masahiko Sawada (#1)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

Em qui., 19 de ago. de 2021 às 09:21, Masahiko Sawada <sawada.mshk@gmail.com>
escreveu:

Hi all ,

It's reported on pgsql-bugs[1] that I/O timings in EXPLAIN don't show
the one for temp files. I think it's not a bug but could be an item
for PG15. As mentioned on that thread, this would be useful for users
in a case where temp buffers I/O used most of the time. So I've
written the patch for that. Please note that the patch includes only
to show temp buffer I/O timing to EXPLAIN but not other possibly
related changes such as pg_stat_statement improvements yet.

Before (w/o patch):
postgres(1:14101)=# explain (analyze, buffers) select count(*) from
generate_series(1,100000);
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1250.00..1250.01 rows=1 width=8) (actual
time=59.025..59.026 rows=1 loops=1)
Buffers: temp read=171 written=171
-> Function Scan on generate_series (cost=0.00..1000.00
rows=100000 width=0) (actual time=21.695..45.524 rows=100000 loops=1)
Buffers: temp read=171 written=171
Planning Time: 0.041 ms
Execution Time: 70.867 ms
(6 rows)

After (w/ patch):
postgres(1:28754)=# explain (analyze, buffers) select count(*) from
generate_series(1,100000);
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1250.00..1250.01 rows=1 width=8) (actual
time=56.189..56.190 rows=1 loops=1)
Buffers: temp read=171 written=171
I/O Timings: temp read=0.487 write=2.073
-> Function Scan on generate_series (cost=0.00..1000.00
rows=100000 width=0) (actual time=21.072..42.886 rows=100000 loops=1)
Buffers: temp read=171 written=171
I/O Timings: temp read=0.487 write=2.073
Planning Time: 0.041 ms
Execution Time: 59.928 ms
(8 rows)

Feedback is very welcome.

The presentation seems a little confusing, wouldn't it be better?

I/O Timings: shared/local read=xxxx write=xxx temp read=0.487 write=2.073

I think can remove this lines:
+ if (has_temp_timing)
+ appendStringInfoChar(es->str, ',');

regards,
Ranier Vilela

#3Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Ranier Vilela (#2)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Thu, Aug 19, 2021 at 10:52 PM Ranier Vilela <ranier.vf@gmail.com> wrote:

Em qui., 19 de ago. de 2021 às 09:21, Masahiko Sawada <sawada.mshk@gmail.com> escreveu:

Hi all ,

It's reported on pgsql-bugs[1] that I/O timings in EXPLAIN don't show
the one for temp files. I think it's not a bug but could be an item
for PG15. As mentioned on that thread, this would be useful for users
in a case where temp buffers I/O used most of the time. So I've
written the patch for that. Please note that the patch includes only
to show temp buffer I/O timing to EXPLAIN but not other possibly
related changes such as pg_stat_statement improvements yet.

Before (w/o patch):
postgres(1:14101)=# explain (analyze, buffers) select count(*) from
generate_series(1,100000);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1250.00..1250.01 rows=1 width=8) (actual
time=59.025..59.026 rows=1 loops=1)
Buffers: temp read=171 written=171
-> Function Scan on generate_series (cost=0.00..1000.00
rows=100000 width=0) (actual time=21.695..45.524 rows=100000 loops=1)
Buffers: temp read=171 written=171
Planning Time: 0.041 ms
Execution Time: 70.867 ms
(6 rows)

After (w/ patch):
postgres(1:28754)=# explain (analyze, buffers) select count(*) from
generate_series(1,100000);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1250.00..1250.01 rows=1 width=8) (actual
time=56.189..56.190 rows=1 loops=1)
Buffers: temp read=171 written=171
I/O Timings: temp read=0.487 write=2.073
-> Function Scan on generate_series (cost=0.00..1000.00
rows=100000 width=0) (actual time=21.072..42.886 rows=100000 loops=1)
Buffers: temp read=171 written=171
I/O Timings: temp read=0.487 write=2.073
Planning Time: 0.041 ms
Execution Time: 59.928 ms
(8 rows)

Feedback is very welcome.

Thank you for the comments!

The presentation seems a little confusing, wouldn't it be better?

I/O Timings: shared/local read=xxxx write=xxx temp read=0.487 write=2.073

Yeah, it looks better to add "shared/local".

I think can remove this lines:
+ if (has_temp_timing)
+ appendStringInfoChar(es->str, ',');

But I think that it's consistent with buffers statistics in EXPLAIN
command. For example, "Buffers" in the output of EXPLAIN command could
be like:

Buffers: shared hit=1398, temp read=526 written=526

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#4Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#3)
1 attachment(s)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Mon, Aug 23, 2021 at 10:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Aug 19, 2021 at 10:52 PM Ranier Vilela <ranier.vf@gmail.com> wrote:

Em qui., 19 de ago. de 2021 às 09:21, Masahiko Sawada <sawada.mshk@gmail.com> escreveu:

Hi all ,

It's reported on pgsql-bugs[1] that I/O timings in EXPLAIN don't show
the one for temp files. I think it's not a bug but could be an item
for PG15. As mentioned on that thread, this would be useful for users
in a case where temp buffers I/O used most of the time. So I've
written the patch for that. Please note that the patch includes only
to show temp buffer I/O timing to EXPLAIN but not other possibly
related changes such as pg_stat_statement improvements yet.

Before (w/o patch):
postgres(1:14101)=# explain (analyze, buffers) select count(*) from
generate_series(1,100000);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1250.00..1250.01 rows=1 width=8) (actual
time=59.025..59.026 rows=1 loops=1)
Buffers: temp read=171 written=171
-> Function Scan on generate_series (cost=0.00..1000.00
rows=100000 width=0) (actual time=21.695..45.524 rows=100000 loops=1)
Buffers: temp read=171 written=171
Planning Time: 0.041 ms
Execution Time: 70.867 ms
(6 rows)

After (w/ patch):
postgres(1:28754)=# explain (analyze, buffers) select count(*) from
generate_series(1,100000);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1250.00..1250.01 rows=1 width=8) (actual
time=56.189..56.190 rows=1 loops=1)
Buffers: temp read=171 written=171
I/O Timings: temp read=0.487 write=2.073
-> Function Scan on generate_series (cost=0.00..1000.00
rows=100000 width=0) (actual time=21.072..42.886 rows=100000 loops=1)
Buffers: temp read=171 written=171
I/O Timings: temp read=0.487 write=2.073
Planning Time: 0.041 ms
Execution Time: 59.928 ms
(8 rows)

Feedback is very welcome.

Thank you for the comments!

The presentation seems a little confusing, wouldn't it be better?

I/O Timings: shared/local read=xxxx write=xxx temp read=0.487 write=2.073

Yeah, it looks better to add "shared/local".

I've attached the updated patch that incorporates the above comment.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

v2-0001-Track-I-O-timing-for-temp-buffers.patchapplication/octet-stream; name=v2-0001-Track-I-O-timing-for-temp-buffers.patchDownload
From dc76ac40030ceafb54a11a8126b1702fb807a64a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 18 Aug 2021 14:55:17 +0900
Subject: [PATCH v2] Track I/O timing for temp buffers

Previously, the output of EXPLAIN with BUFFERS option shows only I/O
timing spent reading and writing shared and local buffers. This
commit shows I/O timing for temp buffers in the output of EXPLAIN. It
is helpful for users in a case where it is the temp buffers I/O that
is using most of the time.

Same as shared/local buffer I/O timings, temp buffers I/O timing is
shown only when track_io_timing is enabled.
---
 doc/src/sgml/ref/explain.sgml      |  8 +++----
 src/backend/commands/explain.c     | 36 +++++++++++++++++++++++-------
 src/backend/executor/instrument.c  |  6 +++++
 src/backend/storage/file/buffile.c | 29 +++++++++++++++++++++++-
 src/include/executor/instrument.h  |  2 ++
 5 files changed, 68 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 4d758fb237..70c65c4ba1 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -174,10 +174,10 @@ ROLLBACK;
       Include information on buffer usage. Specifically, include the number of
       shared blocks hit, read, dirtied, and written, the number of local blocks
       hit, read, dirtied, and written, the number of temp blocks read and
-      written, and the time spent reading and writing data file blocks
-      (in milliseconds) if <xref linkend="guc-track-io-timing"/> is enabled.
-      A <emphasis>hit</emphasis> means that a read was avoided because the block was
-      found already in cache when needed.
+      written, and the time spent reading and writing data file blocks and
+      temp file blocks (in milliseconds) if <xref linkend="guc-track-io-timing"/>
+      is enabled.  A <emphasis>hit</emphasis> means that a read was avoided because
+      the block was found already in cache when needed.
       Shared blocks contain data from regular tables and indexes;
       local blocks contain data from temporary tables and indexes;
       while temp blocks contain short-term working data used in sorts, hashes,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 10644dfac4..c5b643cd31 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3501,8 +3501,11 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 								usage->temp_blks_written > 0);
 		bool		has_timing = (!INSTR_TIME_IS_ZERO(usage->blk_read_time) ||
 								  !INSTR_TIME_IS_ZERO(usage->blk_write_time));
+		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
+									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
 		bool		show_planning = (planning && (has_shared ||
-												  has_local || has_temp || has_timing));
+												  has_local || has_temp || has_timing ||
+												  has_temp_timing));
 
 		if (show_planning)
 		{
@@ -3567,16 +3570,33 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 		}
 
 		/* As above, show only positive counter values. */
-		if (has_timing)
+		if (has_timing || has_temp_timing)
 		{
 			ExplainIndentText(es);
 			appendStringInfoString(es->str, "I/O Timings:");
-			if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
-				appendStringInfo(es->str, " read=%0.3f",
-								 INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
-			if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
-				appendStringInfo(es->str, " write=%0.3f",
-								 INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
+
+			if (has_timing)
+			{
+				appendStringInfoString(es->str, " shared/local");
+				if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
+					appendStringInfo(es->str, " read=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
+				if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
+					appendStringInfo(es->str, " write=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
+				if (has_temp_timing)
+					appendStringInfoChar(es->str, ',');
+			}
+			if (has_temp_timing)
+			{
+				appendStringInfoString(es->str, " temp");
+				if (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time))
+					appendStringInfo(es->str, " read=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+				if (!INSTR_TIME_IS_ZERO(usage->temp_blk_write_time))
+					appendStringInfo(es->str, " write=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+			}
 			appendStringInfoChar(es->str, '\n');
 		}
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2b106d8473..5c3cc4b1cc 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -237,6 +237,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->temp_blks_written += add->temp_blks_written;
 	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
 	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+	INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
+	INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
 }
 
 /* dst += add - sub */
@@ -259,6 +261,10 @@ BufferUsageAccumDiff(BufferUsage *dst,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
+	INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
+						  add->temp_blk_read_time, sub->temp_blk_read_time);
+	INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
+						  add->temp_blk_write_time, sub->temp_blk_write_time);
 }
 
 /* helper functions for WAL usage accumulation */
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index a4be5fe513..0188fbc1b6 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -417,6 +417,8 @@ static void
 BufFileLoadBuffer(BufFile *file)
 {
 	File		thisfile;
+	instr_time	io_start;
+	instr_time	io_time;
 
 	/*
 	 * Advance to next component file if necessary and possible.
@@ -428,10 +430,14 @@ BufFileLoadBuffer(BufFile *file)
 		file->curOffset = 0L;
 	}
 
+	thisfile = file->files[file->curFile];
+
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+
 	/*
 	 * Read whatever we can get, up to a full bufferload.
 	 */
-	thisfile = file->files[file->curFile];
 	file->nbytes = FileRead(thisfile,
 							file->buffer.data,
 							sizeof(file->buffer),
@@ -446,6 +452,13 @@ BufFileLoadBuffer(BufFile *file)
 						FilePathName(thisfile))));
 	}
 
+	if (track_io_timing)
+	{
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, io_start);
+		INSTR_TIME_ADD(pgBufferUsage.temp_blk_read_time, io_time);
+	}
+
 	/* we choose not to advance curOffset here */
 
 	if (file->nbytes > 0)
@@ -473,6 +486,8 @@ BufFileDumpBuffer(BufFile *file)
 	while (wpos < file->nbytes)
 	{
 		off_t		availbytes;
+		instr_time	io_start;
+		instr_time	io_time;
 
 		/*
 		 * Advance to next component file if necessary and possible.
@@ -495,6 +510,10 @@ BufFileDumpBuffer(BufFile *file)
 			bytestowrite = (int) availbytes;
 
 		thisfile = file->files[file->curFile];
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+
 		bytestowrite = FileWrite(thisfile,
 								 file->buffer.data + wpos,
 								 bytestowrite,
@@ -505,6 +524,14 @@ BufFileDumpBuffer(BufFile *file)
 					(errcode_for_file_access(),
 					 errmsg("could not write to file \"%s\": %m",
 							FilePathName(thisfile))));
+
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			INSTR_TIME_ADD(pgBufferUsage.temp_blk_write_time, io_time);
+		}
+
 		file->curOffset += bytestowrite;
 		wpos += bytestowrite;
 
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 2f9905b7c8..6f3cbcabef 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -35,6 +35,8 @@ typedef struct BufferUsage
 	int64		temp_blks_written;	/* # of temp blocks written */
 	instr_time	blk_read_time;	/* time spent reading */
 	instr_time	blk_write_time; /* time spent writing */
+	instr_time	temp_blk_read_time;	/* time spent reading tmp blocks */
+	instr_time	temp_blk_write_time; /* time spent writing tmp blocks */
 } BufferUsage;
 
 /*
-- 
2.24.3 (Apple Git-128)

#5Ranier Vilela
ranier.vf@gmail.com
In reply to: Masahiko Sawada (#4)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

Em ter., 24 de ago. de 2021 às 03:11, Masahiko Sawada <sawada.mshk@gmail.com>
escreveu:

On Mon, Aug 23, 2021 at 10:46 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Thu, Aug 19, 2021 at 10:52 PM Ranier Vilela <ranier.vf@gmail.com>

wrote:

Em qui., 19 de ago. de 2021 às 09:21, Masahiko Sawada <

sawada.mshk@gmail.com> escreveu:

Hi all ,

It's reported on pgsql-bugs[1] that I/O timings in EXPLAIN don't show
the one for temp files. I think it's not a bug but could be an item
for PG15. As mentioned on that thread, this would be useful for users
in a case where temp buffers I/O used most of the time. So I've
written the patch for that. Please note that the patch includes only
to show temp buffer I/O timing to EXPLAIN but not other possibly
related changes such as pg_stat_statement improvements yet.

Before (w/o patch):
postgres(1:14101)=# explain (analyze, buffers) select count(*) from
generate_series(1,100000);
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=1250.00..1250.01 rows=1 width=8) (actual
time=59.025..59.026 rows=1 loops=1)
Buffers: temp read=171 written=171
-> Function Scan on generate_series (cost=0.00..1000.00
rows=100000 width=0) (actual time=21.695..45.524 rows=100000 loops=1)
Buffers: temp read=171 written=171
Planning Time: 0.041 ms
Execution Time: 70.867 ms
(6 rows)

After (w/ patch):
postgres(1:28754)=# explain (analyze, buffers) select count(*) from
generate_series(1,100000);
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=1250.00..1250.01 rows=1 width=8) (actual
time=56.189..56.190 rows=1 loops=1)
Buffers: temp read=171 written=171
I/O Timings: temp read=0.487 write=2.073
-> Function Scan on generate_series (cost=0.00..1000.00
rows=100000 width=0) (actual time=21.072..42.886 rows=100000 loops=1)
Buffers: temp read=171 written=171
I/O Timings: temp read=0.487 write=2.073
Planning Time: 0.041 ms
Execution Time: 59.928 ms
(8 rows)

Feedback is very welcome.

Thank you for the comments!

The presentation seems a little confusing, wouldn't it be better?

I/O Timings: shared/local read=xxxx write=xxx temp read=0.487

write=2.073

Yeah, it looks better to add "shared/local".

I've attached the updated patch that incorporates the above comment.

The patch looks fine to me.

regards,
Ranier Vilela

#6Noname
gkokolatos@pm.me
In reply to: Ranier Vilela (#5)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

Hi,

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, August 24th, 2021 at 13:20, Ranier Vilela <ranier.vf@gmail.com> wrote:

Em ter., 24 de ago. de 2021 às 03:11, Masahiko Sawada <sawada.mshk@gmail.com> escreveu:

On Mon, Aug 23, 2021 at 10:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Aug 19, 2021 at 10:52 PM Ranier Vilela <ranier.vf@gmail.com> wrote:

Em qui., 19 de ago. de 2021 às 09:21, Masahiko Sawada <sawada.mshk@gmail.com> escreveu:

Hi all ,

It's reported on pgsql-bugs[1] that I/O timings in EXPLAIN don't show

<snip>

I've attached the updated patch that incorporates the above comment.

The patch looks fine to me.

The patch looks good to me too. However I do wonder why the timing is added only on
the

if (es->format == EXPLAIN_FORMAT_TEXT)

block and is not added when, for example, the format is json. The instrumentation has
clearly recorded the timings regardless of the output format.

Also, it might be worth while to consider adding some regression tests. To my
understanding, explain.sql provides a function, explain_filter, which helps create
a stable result. For example, such a test case can be:

set track_io_timing = 'on';
select explain_filter('explain (analyze, buffers) select count(*) from generate_series(1,100000)');

then it would be enough to verify that the line:

I/O Timings: temp read=N.N write=N.N

is present. The above would apply on the json output via `explain_filter_to_json`
of course.

Thoughts?

Cheers,
//Georgios

#7Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Noname (#6)
1 attachment(s)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Tue, Sep 7, 2021 at 12:02 AM <gkokolatos@pm.me> wrote:

Hi,

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, August 24th, 2021 at 13:20, Ranier Vilela <ranier.vf@gmail.com> wrote:

Em ter., 24 de ago. de 2021 às 03:11, Masahiko Sawada <sawada.mshk@gmail.com> escreveu:

On Mon, Aug 23, 2021 at 10:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Aug 19, 2021 at 10:52 PM Ranier Vilela <ranier.vf@gmail.com> wrote:

Em qui., 19 de ago. de 2021 às 09:21, Masahiko Sawada <sawada.mshk@gmail.com> escreveu:

Hi all ,

It's reported on pgsql-bugs[1] that I/O timings in EXPLAIN don't show

<snip>

I've attached the updated patch that incorporates the above comment.

The patch looks fine to me.

Thank you for the comments!

The patch looks good to me too. However I do wonder why the timing is added only on
the

if (es->format == EXPLAIN_FORMAT_TEXT)

block and is not added when, for example, the format is json. The instrumentation has
clearly recorded the timings regardless of the output format.

Good point. Fixed.

Also, it might be worth while to consider adding some regression tests. To my
understanding, explain.sql provides a function, explain_filter, which helps create
a stable result. For example, such a test case can be:

set track_io_timing = 'on';
select explain_filter('explain (analyze, buffers) select count(*) from generate_series(1,100000)');

then it would be enough to verify that the line:

I/O Timings: temp read=N.N write=N.N

is present. The above would apply on the json output via `explain_filter_to_json`
of course.

Agreed. I've added regression tests.

I've attached an updated patch. Please review it.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

v3-0001-Track-I-O-timing-for-temp-buffers.patchapplication/octet-stream; name=v3-0001-Track-I-O-timing-for-temp-buffers.patchDownload
From bf524d3e3b2c3868b241395fc7bb34024643fd64 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 18 Aug 2021 14:55:17 +0900
Subject: [PATCH v3] Track I/O timing for temp buffers

Previously, the output of EXPLAIN with BUFFERS option shows only I/O
timing spent reading and writing shared and local buffers. This
commit shows I/O timing for temp buffers in the output of EXPLAIN. It
is helpful for users in a case where it is the temp buffers I/O that
is using most of the time.

Same as shared/local buffer I/O timings, temp buffers I/O timing is
shown only when track_io_timing is enabled.
---
 doc/src/sgml/ref/explain.sgml         |   8 +-
 src/backend/commands/explain.c        |  42 ++++++++--
 src/backend/executor/instrument.c     |   6 ++
 src/backend/storage/file/buffile.c    |  26 +++++++
 src/include/executor/instrument.h     |   2 +
 src/test/regress/expected/explain.out | 106 ++++++++++++++++++++++++++
 src/test/regress/sql/explain.sql      |   8 ++
 7 files changed, 186 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 4d758fb237..70c65c4ba1 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -174,10 +174,10 @@ ROLLBACK;
       Include information on buffer usage. Specifically, include the number of
       shared blocks hit, read, dirtied, and written, the number of local blocks
       hit, read, dirtied, and written, the number of temp blocks read and
-      written, and the time spent reading and writing data file blocks
-      (in milliseconds) if <xref linkend="guc-track-io-timing"/> is enabled.
-      A <emphasis>hit</emphasis> means that a read was avoided because the block was
-      found already in cache when needed.
+      written, and the time spent reading and writing data file blocks and
+      temp file blocks (in milliseconds) if <xref linkend="guc-track-io-timing"/>
+      is enabled.  A <emphasis>hit</emphasis> means that a read was avoided because
+      the block was found already in cache when needed.
       Shared blocks contain data from regular tables and indexes;
       local blocks contain data from temporary tables and indexes;
       while temp blocks contain short-term working data used in sorts, hashes,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 10644dfac4..02052af55c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3501,8 +3501,11 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 								usage->temp_blks_written > 0);
 		bool		has_timing = (!INSTR_TIME_IS_ZERO(usage->blk_read_time) ||
 								  !INSTR_TIME_IS_ZERO(usage->blk_write_time));
+		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
+									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
 		bool		show_planning = (planning && (has_shared ||
-												  has_local || has_temp || has_timing));
+												  has_local || has_temp || has_timing ||
+												  has_temp_timing));
 
 		if (show_planning)
 		{
@@ -3567,16 +3570,33 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 		}
 
 		/* As above, show only positive counter values. */
-		if (has_timing)
+		if (has_timing || has_temp_timing)
 		{
 			ExplainIndentText(es);
 			appendStringInfoString(es->str, "I/O Timings:");
-			if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
-				appendStringInfo(es->str, " read=%0.3f",
-								 INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
-			if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
-				appendStringInfo(es->str, " write=%0.3f",
-								 INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
+
+			if (has_timing)
+			{
+				appendStringInfoString(es->str, " shared/local");
+				if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
+					appendStringInfo(es->str, " read=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
+				if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
+					appendStringInfo(es->str, " write=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
+				if (has_temp_timing)
+					appendStringInfoChar(es->str, ',');
+			}
+			if (has_temp_timing)
+			{
+				appendStringInfoString(es->str, " temp");
+				if (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time))
+					appendStringInfo(es->str, " read=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+				if (!INSTR_TIME_IS_ZERO(usage->temp_blk_write_time))
+					appendStringInfo(es->str, " write=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+			}
 			appendStringInfoChar(es->str, '\n');
 		}
 
@@ -3613,6 +3633,12 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			ExplainPropertyFloat("I/O Write Time", "ms",
 								 INSTR_TIME_GET_MILLISEC(usage->blk_write_time),
 								 3, es);
+			ExplainPropertyFloat("Temp I/O Read Time", "ms",
+								 INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time),
+								 3, es);
+			ExplainPropertyFloat("Temp I/O Write Time", "ms",
+								 INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
+								 3, es);
 		}
 	}
 }
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2b106d8473..5c3cc4b1cc 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -237,6 +237,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->temp_blks_written += add->temp_blks_written;
 	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
 	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+	INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
+	INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
 }
 
 /* dst += add - sub */
@@ -259,6 +261,10 @@ BufferUsageAccumDiff(BufferUsage *dst,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
+	INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
+						  add->temp_blk_read_time, sub->temp_blk_read_time);
+	INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
+						  add->temp_blk_write_time, sub->temp_blk_write_time);
 }
 
 /* helper functions for WAL usage accumulation */
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index b673150dbe..998c7bffe1 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -429,6 +429,8 @@ static void
 BufFileLoadBuffer(BufFile *file)
 {
 	File		thisfile;
+	instr_time	io_start;
+	instr_time	io_time;
 
 	/*
 	 * Advance to next component file if necessary and possible.
@@ -440,6 +442,9 @@ BufFileLoadBuffer(BufFile *file)
 		file->curOffset = 0L;
 	}
 
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+
 	/*
 	 * Read whatever we can get, up to a full bufferload.
 	 */
@@ -458,6 +463,13 @@ BufFileLoadBuffer(BufFile *file)
 						FilePathName(thisfile))));
 	}
 
+	if (track_io_timing)
+	{
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, io_start);
+		INSTR_TIME_ADD(pgBufferUsage.temp_blk_read_time, io_time);
+	}
+
 	/* we choose not to advance curOffset here */
 
 	if (file->nbytes > 0)
@@ -485,6 +497,8 @@ BufFileDumpBuffer(BufFile *file)
 	while (wpos < file->nbytes)
 	{
 		off_t		availbytes;
+		instr_time	io_start;
+		instr_time	io_time;
 
 		/*
 		 * Advance to next component file if necessary and possible.
@@ -507,6 +521,10 @@ BufFileDumpBuffer(BufFile *file)
 			bytestowrite = (int) availbytes;
 
 		thisfile = file->files[file->curFile];
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+
 		bytestowrite = FileWrite(thisfile,
 								 file->buffer.data + wpos,
 								 bytestowrite,
@@ -517,6 +535,14 @@ BufFileDumpBuffer(BufFile *file)
 					(errcode_for_file_access(),
 					 errmsg("could not write to file \"%s\": %m",
 							FilePathName(thisfile))));
+
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			INSTR_TIME_ADD(pgBufferUsage.temp_blk_write_time, io_time);
+		}
+
 		file->curOffset += bytestowrite;
 		wpos += bytestowrite;
 
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 2f9905b7c8..6f3cbcabef 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -35,6 +35,8 @@ typedef struct BufferUsage
 	int64		temp_blks_written;	/* # of temp blocks written */
 	instr_time	blk_read_time;	/* time spent reading */
 	instr_time	blk_write_time; /* time spent writing */
+	instr_time	temp_blk_read_time;	/* time spent reading tmp blocks */
+	instr_time	temp_blk_write_time; /* time spent writing tmp blocks */
 } BufferUsage;
 
 /*
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 1734dfee8c..2534f92c72 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -502,3 +502,109 @@ select explain_filter('explain (verbose) select * from int8_tbl i8');
  Query Identifier: N
 (3 rows)
 
+-- Test I/O timing for temp buffers
+set track_io_timing = on;
+set work_mem to '4MB';
+select explain_filter('explain (analyze, buffers) select count(*) from generate_series(1, 100000)');
+                                                explain_filter                                                
+--------------------------------------------------------------------------------------------------------------
+ Aggregate  (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N loops=N)
+   I/O Timings: temp read=N.N write=N.N
+   ->  Function Scan on generate_series  (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N loops=N)
+         I/O Timings: temp read=N.N write=N.N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+select explain_filter('explain (analyze, buffers, format json) select count(*) from generate_series(1, 100000)');
+                explain_filter                 
+-----------------------------------------------
+ [                                            +
+   {                                          +
+     "Plan": {                                +
+       "Node Type": "Aggregate",              +
+       "Strategy": "Plain",                   +
+       "Partial Mode": "Simple",              +
+       "Parallel Aware": false,               +
+       "Async Capable": false,                +
+       "Startup Cost": N.N,                   +
+       "Total Cost": N.N,                     +
+       "Plan Rows": N,                        +
+       "Plan Width": N,                       +
+       "Actual Startup Time": N.N,            +
+       "Actual Total Time": N.N,              +
+       "Actual Rows": N,                      +
+       "Actual Loops": N,                     +
+       "Shared Hit Blocks": N,                +
+       "Shared Read Blocks": N,               +
+       "Shared Dirtied Blocks": N,            +
+       "Shared Written Blocks": N,            +
+       "Local Hit Blocks": N,                 +
+       "Local Read Blocks": N,                +
+       "Local Dirtied Blocks": N,             +
+       "Local Written Blocks": N,             +
+       "Temp Read Blocks": N,                 +
+       "Temp Written Blocks": N,              +
+       "I/O Read Time": N.N,                  +
+       "I/O Write Time": N.N,                 +
+       "Temp I/O Read Time": N.N,             +
+       "Temp I/O Write Time": N.N,            +
+       "Plans": [                             +
+         {                                    +
+           "Node Type": "Function Scan",      +
+           "Parent Relationship": "Outer",    +
+           "Parallel Aware": false,           +
+           "Async Capable": false,            +
+           "Function Name": "generate_series",+
+           "Alias": "generate_series",        +
+           "Startup Cost": N.N,               +
+           "Total Cost": N.N,                 +
+           "Plan Rows": N,                    +
+           "Plan Width": N,                   +
+           "Actual Startup Time": N.N,        +
+           "Actual Total Time": N.N,          +
+           "Actual Rows": N,                  +
+           "Actual Loops": N,                 +
+           "Shared Hit Blocks": N,            +
+           "Shared Read Blocks": N,           +
+           "Shared Dirtied Blocks": N,        +
+           "Shared Written Blocks": N,        +
+           "Local Hit Blocks": N,             +
+           "Local Read Blocks": N,            +
+           "Local Dirtied Blocks": N,         +
+           "Local Written Blocks": N,         +
+           "Temp Read Blocks": N,             +
+           "Temp Written Blocks": N,          +
+           "I/O Read Time": N.N,              +
+           "I/O Write Time": N.N,             +
+           "Temp I/O Read Time": N.N,         +
+           "Temp I/O Write Time": N.N         +
+         }                                    +
+       ]                                      +
+     },                                       +
+     "Planning": {                            +
+       "Shared Hit Blocks": N,                +
+       "Shared Read Blocks": N,               +
+       "Shared Dirtied Blocks": N,            +
+       "Shared Written Blocks": N,            +
+       "Local Hit Blocks": N,                 +
+       "Local Read Blocks": N,                +
+       "Local Dirtied Blocks": N,             +
+       "Local Written Blocks": N,             +
+       "Temp Read Blocks": N,                 +
+       "Temp Written Blocks": N,              +
+       "I/O Read Time": N.N,                  +
+       "I/O Write Time": N.N,                 +
+       "Temp I/O Read Time": N.N,             +
+       "Temp I/O Write Time": N.N             +
+     },                                       +
+     "Planning Time": N.N,                    +
+     "Triggers": [                            +
+     ],                                       +
+     "Execution Time": N.N                    +
+   }                                          +
+ ]
+(1 row)
+
+reset track_io_timing;
+reset work_mem;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index 84549c78fa..7cf9f4734c 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -120,3 +120,11 @@ select explain_filter('explain (verbose) select * from t1 where pg_temp.mysin(f1
 -- Test compute_query_id
 set compute_query_id = on;
 select explain_filter('explain (verbose) select * from int8_tbl i8');
+
+-- Test I/O timing for temp buffers
+set track_io_timing = on;
+set work_mem to '4MB';
+select explain_filter('explain (analyze, buffers) select count(*) from generate_series(1, 100000)');
+select explain_filter('explain (analyze, buffers, format json) select count(*) from generate_series(1, 100000)');
+reset track_io_timing;
+reset work_mem;
-- 
2.24.3 (Apple Git-128)

#8Melanie Plageman
melanieplageman@gmail.com
In reply to: Masahiko Sawada (#3)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Sun, Aug 22, 2021 at 9:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Aug 19, 2021 at 10:52 PM Ranier Vilela <ranier.vf@gmail.com> wrote:

Em qui., 19 de ago. de 2021 às 09:21, Masahiko Sawada <sawada.mshk@gmail.com> escreveu:

The presentation seems a little confusing, wouldn't it be better?

I/O Timings: shared/local read=xxxx write=xxx temp read=0.487 write=2.073

Yeah, it looks better to add "shared/local".

Using the patch, I do feel like the EXPLAIN format of

shared/local xxx, temp xxx

is a bit confusing. If temp is going to be its own EXPLAIN IO timing
category (as opposed to being summed with relation data block IO from
local and shared buffers), then it seems like local and shared should
be separated as well.

shared xxx, local xxx, temp xxx

With your patch applied, below is the top of the EXPLAIN output for a
query joining a temporary table (so using local buffers) to a regular
table (using shared buffers) and spilling the hash join (temp files).

Aggregate (actual rows=1 loops=1)
Buffers: shared read=4425, local read=4425 dirtied=4425
written=4423, temp read=5963 written=5963
I/O Timings: shared/local read=23.546, temp read=13.309 write=74.198

I found that using the same terminology as the "EXPLAIN BUFFERS" output
but not using the same categories was kind of confusing.

If it is only meaningful to distinguish between relation data IO and
query temp file IO, then maybe the words used in I/O Timings in EXPLAIN
should be "rel data" and "temp" or something like that.

- Melanie

#9Julien Rouhaud
rjuju123@gmail.com
In reply to: Masahiko Sawada (#7)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

Hi,

On Tue, Nov 16, 2021 at 04:37:44PM +0900, Masahiko Sawada wrote:

I've attached an updated patch. Please review it.

It seems that the regression tests aren't entirely stable, per cfbot:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3298.

The failures look like:

diff -U3 /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/explain.out /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/explain.out
--- /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/explain.out	2022-01-19 03:50:37.087931908 +0000
+++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/explain.out	2022-01-19 03:57:41.013616137 +0000
@@ -512,9 +512,10 @@
    I/O Timings: temp read=N.N write=N.N
    ->  Function Scan on generate_series  (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N loops=N)
          I/O Timings: temp read=N.N write=N.N
+   I/O Timings: shared/local read=N.N
  Planning Time: N.N ms
  Execution Time: N.N ms
-(6 rows)
+(7 rows)

select explain_filter('explain (analyze, buffers, format json) select count(*) from generate_series(1, 100000)');
explain_filter

I don't see any obvious error in the code, so I'm wondering if it's simply
the result of populating the cache with generate_series() info.

#10Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Melanie Plageman (#8)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

Hi,

Sorry for the late reply.

On Fri, Nov 19, 2021 at 7:24 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Sun, Aug 22, 2021 at 9:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Aug 19, 2021 at 10:52 PM Ranier Vilela <ranier.vf@gmail.com> wrote:

Em qui., 19 de ago. de 2021 às 09:21, Masahiko Sawada <sawada.mshk@gmail.com> escreveu:

The presentation seems a little confusing, wouldn't it be better?

I/O Timings: shared/local read=xxxx write=xxx temp read=0.487 write=2.073

Yeah, it looks better to add "shared/local".

Using the patch, I do feel like the EXPLAIN format of

shared/local xxx, temp xxx

is a bit confusing. If temp is going to be its own EXPLAIN IO timing
category (as opposed to being summed with relation data block IO from
local and shared buffers), then it seems like local and shared should
be separated as well.

shared xxx, local xxx, temp xxx

I think the I/O timing shown as shared/local is the time spent on disk
I/O so it doesn't matter if these disk I/O are for shared buffers or
local buffers.

With your patch applied, below is the top of the EXPLAIN output for a
query joining a temporary table (so using local buffers) to a regular
table (using shared buffers) and spilling the hash join (temp files).

Aggregate (actual rows=1 loops=1)
Buffers: shared read=4425, local read=4425 dirtied=4425
written=4423, temp read=5963 written=5963
I/O Timings: shared/local read=23.546, temp read=13.309 write=74.198

I found that using the same terminology as the "EXPLAIN BUFFERS" output
but not using the same categories was kind of confusing.

If it is only meaningful to distinguish between relation data IO and
query temp file IO, then maybe the words used in I/O Timings in EXPLAIN
should be "rel data" and "temp" or something like that.

But if we do that, we end up using different terminology in "I/O
Timing" and "Buffers". I think it's better to use consistent words
used in them.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#11Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Julien Rouhaud (#9)
2 attachment(s)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Wed, Jan 19, 2022 at 5:52 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

On Tue, Nov 16, 2021 at 04:37:44PM +0900, Masahiko Sawada wrote:

I've attached an updated patch. Please review it.

It seems that the regression tests aren't entirely stable, per cfbot:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3298.

The failures look like:

diff -U3 /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/explain.out /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/explain.out
--- /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/explain.out      2022-01-19 03:50:37.087931908 +0000
+++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/explain.out        2022-01-19 03:57:41.013616137 +0000
@@ -512,9 +512,10 @@
I/O Timings: temp read=N.N write=N.N
->  Function Scan on generate_series  (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N loops=N)
I/O Timings: temp read=N.N write=N.N
+   I/O Timings: shared/local read=N.N
Planning Time: N.N ms
Execution Time: N.N ms
-(6 rows)
+(7 rows)

select explain_filter('explain (analyze, buffers, format json) select count(*) from generate_series(1, 100000)');
explain_filter

I don't see any obvious error in the code, so I'm wondering if it's simply
the result of populating the cache with generate_series() info.

Thank you for reporting.

You're right, the regression test with track_io_timing = on is not
entirely stable because we may read catalog data during planning time,
resulting in an additional line in the EXPLAIN output. I've removed
the regression tests.

I've attached updated patches. I've included an improvement of
pg_stat_statement support to support temp I/O timing.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

v3-0002-pg_stat_statements-Track-I-O-timing-for-temp-bloc.patchapplication/octet-stream; name=v3-0002-pg_stat_statements-Track-I-O-timing-for-temp-bloc.patchDownload
From 592cd40ddc00ad5723e9061ab547b2e51692bc25 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 1 Mar 2022 10:42:59 +0900
Subject: [PATCH v3 2/2] pg_stat_statements: Track I/O timing for temp blocks.

This commit adds new columsn temp_blk_read_time and
temp_blk_write_time to pg_stat_statements function, showing the time
spent on disk I/O for temp blocks.

Bumped pg_stat_statements version to 1.10 for newly added columns.

XXX: need to bump PGSS_FILE_HEADER.
---
 contrib/pg_stat_statements/Makefile           |  3 +-
 .../expected/oldextversions.out               | 94 +++++++++++++++++++
 .../pg_stat_statements--1.9--1.10.sql         | 59 ++++++++++++
 .../pg_stat_statements/pg_stat_statements.c   | 35 ++++++-
 .../pg_stat_statements.control                |  2 +-
 .../pg_stat_statements/sql/oldextversions.sql | 10 ++
 doc/src/sgml/pgstatstatements.sgml            | 20 ++++
 7 files changed, 217 insertions(+), 6 deletions(-)
 create mode 100644 contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 7fabd96f38..edc40c8bbf 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -6,7 +6,8 @@ OBJS = \
 	pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.8--1.9.sql \
+DATA = pg_stat_statements--1.4.sql \
+	pg_stat_statements--1.9--1.10.sql pg_stat_statements--1.8--1.9.sql \
 	pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
 	pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
 	pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
diff --git a/contrib/pg_stat_statements/expected/oldextversions.out b/contrib/pg_stat_statements/expected/oldextversions.out
index f18c08838f..fbd60380ee 100644
--- a/contrib/pg_stat_statements/expected/oldextversions.out
+++ b/contrib/pg_stat_statements/expected/oldextversions.out
@@ -136,4 +136,98 @@ SELECT pg_get_functiondef('pg_stat_statements_reset'::regproc);
  
 (1 row)
 
+-- New function pg_stat_statement_info, and new function
+-- and view for pg_stat_statements introduced in 1.9
+AlTER EXTENSION pg_stat_statements UPDATE TO '1.9';
+SELECT pg_get_functiondef('pg_stat_statements_info'::regproc);
+                                                   pg_get_functiondef                                                    
+-------------------------------------------------------------------------------------------------------------------------
+ CREATE OR REPLACE FUNCTION public.pg_stat_statements_info(OUT dealloc bigint, OUT stats_reset timestamp with time zone)+
+  RETURNS record                                                                                                        +
+  LANGUAGE c                                                                                                            +
+  PARALLEL SAFE STRICT                                                                                                  +
+ AS '$libdir/pg_stat_statements', $function$pg_stat_statements_info$function$                                           +
+ 
+(1 row)
+
+\d pg_stat_statements
+                    View "public.pg_stat_statements"
+       Column        |       Type       | Collation | Nullable | Default 
+---------------------+------------------+-----------+----------+---------
+ userid              | oid              |           |          | 
+ dbid                | oid              |           |          | 
+ toplevel            | boolean          |           |          | 
+ queryid             | bigint           |           |          | 
+ query               | text             |           |          | 
+ plans               | bigint           |           |          | 
+ total_plan_time     | double precision |           |          | 
+ min_plan_time       | double precision |           |          | 
+ max_plan_time       | double precision |           |          | 
+ mean_plan_time      | double precision |           |          | 
+ stddev_plan_time    | double precision |           |          | 
+ calls               | bigint           |           |          | 
+ total_exec_time     | double precision |           |          | 
+ min_exec_time       | double precision |           |          | 
+ max_exec_time       | double precision |           |          | 
+ mean_exec_time      | double precision |           |          | 
+ stddev_exec_time    | double precision |           |          | 
+ rows                | bigint           |           |          | 
+ shared_blks_hit     | bigint           |           |          | 
+ shared_blks_read    | bigint           |           |          | 
+ shared_blks_dirtied | bigint           |           |          | 
+ shared_blks_written | bigint           |           |          | 
+ local_blks_hit      | bigint           |           |          | 
+ local_blks_read     | bigint           |           |          | 
+ local_blks_dirtied  | bigint           |           |          | 
+ local_blks_written  | bigint           |           |          | 
+ temp_blks_read      | bigint           |           |          | 
+ temp_blks_written   | bigint           |           |          | 
+ blk_read_time       | double precision |           |          | 
+ blk_write_time      | double precision |           |          | 
+ wal_records         | bigint           |           |          | 
+ wal_fpi             | bigint           |           |          | 
+ wal_bytes           | numeric          |           |          | 
+
+-- New functions and views for pg_stat_statements in 1.10
+AlTER EXTENSION pg_stat_statements UPDATE TO '1.10';
+\d pg_stat_statements
+                    View "public.pg_stat_statements"
+       Column        |       Type       | Collation | Nullable | Default 
+---------------------+------------------+-----------+----------+---------
+ userid              | oid              |           |          | 
+ dbid                | oid              |           |          | 
+ toplevel            | boolean          |           |          | 
+ queryid             | bigint           |           |          | 
+ query               | text             |           |          | 
+ plans               | bigint           |           |          | 
+ total_plan_time     | double precision |           |          | 
+ min_plan_time       | double precision |           |          | 
+ max_plan_time       | double precision |           |          | 
+ mean_plan_time      | double precision |           |          | 
+ stddev_plan_time    | double precision |           |          | 
+ calls               | bigint           |           |          | 
+ total_exec_time     | double precision |           |          | 
+ min_exec_time       | double precision |           |          | 
+ max_exec_time       | double precision |           |          | 
+ mean_exec_time      | double precision |           |          | 
+ stddev_exec_time    | double precision |           |          | 
+ rows                | bigint           |           |          | 
+ shared_blks_hit     | bigint           |           |          | 
+ shared_blks_read    | bigint           |           |          | 
+ shared_blks_dirtied | bigint           |           |          | 
+ shared_blks_written | bigint           |           |          | 
+ local_blks_hit      | bigint           |           |          | 
+ local_blks_read     | bigint           |           |          | 
+ local_blks_dirtied  | bigint           |           |          | 
+ local_blks_written  | bigint           |           |          | 
+ temp_blks_read      | bigint           |           |          | 
+ temp_blks_written   | bigint           |           |          | 
+ blk_read_time       | double precision |           |          | 
+ blk_write_time      | double precision |           |          | 
+ temp_blk_read_time  | double precision |           |          | 
+ temp_blk_write_time | double precision |           |          | 
+ wal_records         | bigint           |           |          | 
+ wal_fpi             | bigint           |           |          | 
+ wal_bytes           | numeric          |           |          | 
+
 DROP EXTENSION pg_stat_statements;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql b/contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql
new file mode 100644
index 0000000000..ca777f14e8
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql
@@ -0,0 +1,59 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.10'" to load this file. \quit
+
+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements(boolean);
+
+/* Then we can drop them */
+DROP VIEW pg_stat_statements;
+DROP FUNCTION pg_stat_statements(boolean);
+
+/* Now redefine */
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+    OUT userid oid,
+    OUT dbid oid,
+    OUT toplevel bool,
+    OUT queryid bigint,
+    OUT query text,
+    OUT plans int8,
+    OUT total_plan_time float8,
+    OUT min_plan_time float8,
+    OUT max_plan_time float8,
+    OUT mean_plan_time float8,
+    OUT stddev_plan_time float8,
+    OUT calls int8,
+    OUT total_exec_time float8,
+    OUT min_exec_time float8,
+    OUT max_exec_time float8,
+    OUT mean_exec_time float8,
+    OUT stddev_exec_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_dirtied int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_dirtied int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT blk_read_time float8,
+    OUT blk_write_time float8,
+    OUT temp_blk_read_time float8,
+    OUT temp_blk_write_time float8,
+    OUT wal_records int8,
+    OUT wal_fpi int8,
+    OUT wal_bytes numeric
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_stat_statements_1_10'
+LANGUAGE C STRICT VOLATILE PARALLEL SAFE;
+
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements(true);
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index d803253cea..88748b529e 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -121,7 +121,8 @@ typedef enum pgssVersion
 	PGSS_V1_2,
 	PGSS_V1_3,
 	PGSS_V1_8,
-	PGSS_V1_9
+	PGSS_V1_9,
+	PGSS_V1_10
 } pgssVersion;
 
 typedef enum pgssStoreKind
@@ -183,8 +184,10 @@ typedef struct Counters
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
-	double		blk_read_time;	/* time spent reading, in msec */
-	double		blk_write_time; /* time spent writing, in msec */
+	double		blk_read_time;	/* time spent reading blocks, in msec */
+	double		blk_write_time; /* time spent writing blocks, in msec */
+	double		temp_blk_read_time;	/* time spent reading temp blocks, in msec */
+	double		temp_blk_write_time; /* time spent writing temp blocks, in msec */
 	double		usage;			/* usage factor */
 	int64		wal_records;	/* # of WAL records generated */
 	int64		wal_fpi;		/* # of WAL full page images generated */
@@ -302,6 +305,7 @@ PG_FUNCTION_INFO_V1(pg_stat_statements_1_2);
 PG_FUNCTION_INFO_V1(pg_stat_statements_1_3);
 PG_FUNCTION_INFO_V1(pg_stat_statements_1_8);
 PG_FUNCTION_INFO_V1(pg_stat_statements_1_9);
+PG_FUNCTION_INFO_V1(pg_stat_statements_1_10);
 PG_FUNCTION_INFO_V1(pg_stat_statements);
 PG_FUNCTION_INFO_V1(pg_stat_statements_info);
 
@@ -1369,6 +1373,8 @@ pgss_store(const char *query, uint64 queryId,
 		e->counters.temp_blks_written += bufusage->temp_blks_written;
 		e->counters.blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_read_time);
 		e->counters.blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_write_time);
+		e->counters.temp_blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->temp_blk_read_time);
+		e->counters.temp_blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->temp_blk_write_time);
 		e->counters.usage += USAGE_EXEC(total_time);
 		e->counters.wal_records += walusage->wal_records;
 		e->counters.wal_fpi += walusage->wal_fpi;
@@ -1422,7 +1428,8 @@ pg_stat_statements_reset(PG_FUNCTION_ARGS)
 #define PG_STAT_STATEMENTS_COLS_V1_3	23
 #define PG_STAT_STATEMENTS_COLS_V1_8	32
 #define PG_STAT_STATEMENTS_COLS_V1_9	33
-#define PG_STAT_STATEMENTS_COLS			33	/* maximum of above */
+#define PG_STAT_STATEMENTS_COLS_V1_10	35
+#define PG_STAT_STATEMENTS_COLS			35	/* maximum of above */
 
 /*
  * Retrieve statement statistics.
@@ -1434,6 +1441,16 @@ pg_stat_statements_reset(PG_FUNCTION_ARGS)
  * expected API version is identified by embedding it in the C name of the
  * function.  Unfortunately we weren't bright enough to do that for 1.1.
  */
+Datum
+pg_stat_statements_1_10(PG_FUNCTION_ARGS)
+{
+	bool		showtext = PG_GETARG_BOOL(0);
+
+	pg_stat_statements_internal(fcinfo, PGSS_V1_10, showtext);
+
+	return (Datum) 0;
+}
+
 Datum
 pg_stat_statements_1_9(PG_FUNCTION_ARGS)
 {
@@ -1567,6 +1584,10 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 			if (api_version != PGSS_V1_9)
 				elog(ERROR, "incorrect number of output arguments");
 			break;
+		case PG_STAT_STATEMENTS_COLS_V1_10:
+			if (api_version != PGSS_V1_10)
+				elog(ERROR, "incorrect number of output arguments");
+			break;
 		default:
 			elog(ERROR, "incorrect number of output arguments");
 	}
@@ -1774,6 +1795,11 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 			values[i++] = Float8GetDatumFast(tmp.blk_read_time);
 			values[i++] = Float8GetDatumFast(tmp.blk_write_time);
 		}
+		if (api_version >= PGSS_V1_10)
+		{
+			values[i++] = Float8GetDatumFast(tmp.temp_blk_read_time);
+			values[i++] = Float8GetDatumFast(tmp.temp_blk_write_time);
+		}
 		if (api_version >= PGSS_V1_8)
 		{
 			char		buf[256];
@@ -1798,6 +1824,7 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 					 api_version == PGSS_V1_3 ? PG_STAT_STATEMENTS_COLS_V1_3 :
 					 api_version == PGSS_V1_8 ? PG_STAT_STATEMENTS_COLS_V1_8 :
 					 api_version == PGSS_V1_9 ? PG_STAT_STATEMENTS_COLS_V1_9 :
+					 api_version == PGSS_V1_10 ? PG_STAT_STATEMENTS_COLS_V1_10 :
 					 -1 /* fail if you forget to update this assert */ ));
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/contrib/pg_stat_statements/pg_stat_statements.control b/contrib/pg_stat_statements/pg_stat_statements.control
index 2f1ce6ed50..0747e48138 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.control
+++ b/contrib/pg_stat_statements/pg_stat_statements.control
@@ -1,5 +1,5 @@
 # pg_stat_statements extension
 comment = 'track planning and execution statistics of all SQL statements executed'
-default_version = '1.9'
+default_version = '1.10'
 module_pathname = '$libdir/pg_stat_statements'
 relocatable = true
diff --git a/contrib/pg_stat_statements/sql/oldextversions.sql b/contrib/pg_stat_statements/sql/oldextversions.sql
index f2e822acd3..ea0f7c143d 100644
--- a/contrib/pg_stat_statements/sql/oldextversions.sql
+++ b/contrib/pg_stat_statements/sql/oldextversions.sql
@@ -36,4 +36,14 @@ AlTER EXTENSION pg_stat_statements UPDATE TO '1.8';
 \d pg_stat_statements
 SELECT pg_get_functiondef('pg_stat_statements_reset'::regproc);
 
+-- New function pg_stat_statement_info, and new function
+-- and view for pg_stat_statements introduced in 1.9
+AlTER EXTENSION pg_stat_statements UPDATE TO '1.9';
+SELECT pg_get_functiondef('pg_stat_statements_info'::regproc);
+\d pg_stat_statements
+
+-- New functions and views for pg_stat_statements in 1.10
+AlTER EXTENSION pg_stat_statements UPDATE TO '1.10';
+\d pg_stat_statements
+
 DROP EXTENSION pg_stat_statements;
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index bc9d5bdbe3..6df09f1691 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -353,6 +353,26 @@
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>temp_blk_read_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total time the statement spent reading temp blocks, in milliseconds
+       (if <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>temp_blk_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total time the statement spent writing temp blocks, in milliseconds
+       (if <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>wal_records</structfield> <type>bigint</type>
-- 
2.24.3 (Apple Git-128)

v3-0001-Track-I-O-timing-for-temp-buffers.patchapplication/octet-stream; name=v3-0001-Track-I-O-timing-for-temp-buffers.patchDownload
From a7d10344d24425944e935124f9a41887bf70a557 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 18 Aug 2021 14:55:17 +0900
Subject: [PATCH v3 1/2] Track I/O timing for temp buffers

Previously, the output of EXPLAIN with BUFFERS option shows only I/O
timing spent reading and writing shared and local buffers. This
commit shows I/O timing for temp buffers in the output of EXPLAIN. It
is helpful for users in a case where it is the temp buffers I/O that
is using most of the time.

Same as shared/local buffer I/O timings, temp buffers I/O timing is
shown only when track_io_timing is enabled.
---
 doc/src/sgml/ref/explain.sgml      |  8 +++---
 src/backend/commands/explain.c     | 42 ++++++++++++++++++++++++------
 src/backend/executor/instrument.c  |  6 +++++
 src/backend/storage/file/buffile.c | 29 ++++++++++++++++++++-
 src/include/executor/instrument.h  |  6 +++--
 5 files changed, 76 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 4d758fb237..70c65c4ba1 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -174,10 +174,10 @@ ROLLBACK;
       Include information on buffer usage. Specifically, include the number of
       shared blocks hit, read, dirtied, and written, the number of local blocks
       hit, read, dirtied, and written, the number of temp blocks read and
-      written, and the time spent reading and writing data file blocks
-      (in milliseconds) if <xref linkend="guc-track-io-timing"/> is enabled.
-      A <emphasis>hit</emphasis> means that a read was avoided because the block was
-      found already in cache when needed.
+      written, and the time spent reading and writing data file blocks and
+      temp file blocks (in milliseconds) if <xref linkend="guc-track-io-timing"/>
+      is enabled.  A <emphasis>hit</emphasis> means that a read was avoided because
+      the block was found already in cache when needed.
       Shared blocks contain data from regular tables and indexes;
       local blocks contain data from temporary tables and indexes;
       while temp blocks contain short-term working data used in sorts, hashes,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index de81379da3..f4aaf6d1f3 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3510,8 +3510,11 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 								usage->temp_blks_written > 0);
 		bool		has_timing = (!INSTR_TIME_IS_ZERO(usage->blk_read_time) ||
 								  !INSTR_TIME_IS_ZERO(usage->blk_write_time));
+		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
+									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
 		bool		show_planning = (planning && (has_shared ||
-												  has_local || has_temp || has_timing));
+												  has_local || has_temp || has_timing ||
+												  has_temp_timing));
 
 		if (show_planning)
 		{
@@ -3576,16 +3579,33 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 		}
 
 		/* As above, show only positive counter values. */
-		if (has_timing)
+		if (has_timing || has_temp_timing)
 		{
 			ExplainIndentText(es);
 			appendStringInfoString(es->str, "I/O Timings:");
-			if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
-				appendStringInfo(es->str, " read=%0.3f",
-								 INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
-			if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
-				appendStringInfo(es->str, " write=%0.3f",
-								 INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
+
+			if (has_timing)
+			{
+				appendStringInfoString(es->str, " shared/local");
+				if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
+					appendStringInfo(es->str, " read=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
+				if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
+					appendStringInfo(es->str, " write=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
+				if (has_temp_timing)
+					appendStringInfoChar(es->str, ',');
+			}
+			if (has_temp_timing)
+			{
+				appendStringInfoString(es->str, " temp");
+				if (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time))
+					appendStringInfo(es->str, " read=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+				if (!INSTR_TIME_IS_ZERO(usage->temp_blk_write_time))
+					appendStringInfo(es->str, " write=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+			}
 			appendStringInfoChar(es->str, '\n');
 		}
 
@@ -3622,6 +3642,12 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			ExplainPropertyFloat("I/O Write Time", "ms",
 								 INSTR_TIME_GET_MILLISEC(usage->blk_write_time),
 								 3, es);
+			ExplainPropertyFloat("Temp I/O Read Time", "ms",
+								 INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time),
+								 3, es);
+			ExplainPropertyFloat("Temp I/O Write Time", "ms",
+								 INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
+								 3, es);
 		}
 	}
 }
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index c5ff02a842..ceff4727d4 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -237,6 +237,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->temp_blks_written += add->temp_blks_written;
 	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
 	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+	INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
+	INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
 }
 
 /* dst += add - sub */
@@ -259,6 +261,10 @@ BufferUsageAccumDiff(BufferUsage *dst,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
+	INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
+						  add->temp_blk_read_time, sub->temp_blk_read_time);
+	INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
+						  add->temp_blk_write_time, sub->temp_blk_write_time);
 }
 
 /* helper functions for WAL usage accumulation */
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3108e96dc9..56b88594cc 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -429,6 +429,8 @@ static void
 BufFileLoadBuffer(BufFile *file)
 {
 	File		thisfile;
+	instr_time	io_start;
+	instr_time	io_time;
 
 	/*
 	 * Advance to next component file if necessary and possible.
@@ -440,10 +442,14 @@ BufFileLoadBuffer(BufFile *file)
 		file->curOffset = 0L;
 	}
 
+	thisfile = file->files[file->curFile];
+
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+
 	/*
 	 * Read whatever we can get, up to a full bufferload.
 	 */
-	thisfile = file->files[file->curFile];
 	file->nbytes = FileRead(thisfile,
 							file->buffer.data,
 							sizeof(file->buffer),
@@ -458,6 +464,13 @@ BufFileLoadBuffer(BufFile *file)
 						FilePathName(thisfile))));
 	}
 
+	if (track_io_timing)
+	{
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, io_start);
+		INSTR_TIME_ADD(pgBufferUsage.temp_blk_read_time, io_time);
+	}
+
 	/* we choose not to advance curOffset here */
 
 	if (file->nbytes > 0)
@@ -485,6 +498,8 @@ BufFileDumpBuffer(BufFile *file)
 	while (wpos < file->nbytes)
 	{
 		off_t		availbytes;
+		instr_time	io_start;
+		instr_time	io_time;
 
 		/*
 		 * Advance to next component file if necessary and possible.
@@ -507,6 +522,10 @@ BufFileDumpBuffer(BufFile *file)
 			bytestowrite = (int) availbytes;
 
 		thisfile = file->files[file->curFile];
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+
 		bytestowrite = FileWrite(thisfile,
 								 file->buffer.data + wpos,
 								 bytestowrite,
@@ -517,6 +536,14 @@ BufFileDumpBuffer(BufFile *file)
 					(errcode_for_file_access(),
 					 errmsg("could not write to file \"%s\": %m",
 							FilePathName(thisfile))));
+
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			INSTR_TIME_ADD(pgBufferUsage.temp_blk_write_time, io_time);
+		}
+
 		file->curOffset += bytestowrite;
 		wpos += bytestowrite;
 
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1b7157bdd1..6c080beb81 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,8 +33,10 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
-	instr_time	blk_read_time;	/* time spent reading */
-	instr_time	blk_write_time; /* time spent writing */
+	instr_time	blk_read_time;	/* time spent reading blocks */
+	instr_time	blk_write_time; /* time spent writing blocks */
+	instr_time	temp_blk_read_time;	/* time spent reading temp blocks */
+	instr_time	temp_blk_write_time; /* time spent writing temp blocks */
 } BufferUsage;
 
 /*
-- 
2.24.3 (Apple Git-128)

#12Julien Rouhaud
rjuju123@gmail.com
In reply to: Masahiko Sawada (#11)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

Hi,

On Tue, Mar 01, 2022 at 11:35:32AM +0900, Masahiko Sawada wrote:

On Wed, Jan 19, 2022 at 5:52 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

It seems that the regression tests aren't entirely stable, per cfbot:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3298.

The failures look like:

diff -U3 /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/explain.out /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/explain.out
--- /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/explain.out      2022-01-19 03:50:37.087931908 +0000
+++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/explain.out        2022-01-19 03:57:41.013616137 +0000
@@ -512,9 +512,10 @@
I/O Timings: temp read=N.N write=N.N
->  Function Scan on generate_series  (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N loops=N)
I/O Timings: temp read=N.N write=N.N
+   I/O Timings: shared/local read=N.N
Planning Time: N.N ms
Execution Time: N.N ms
-(6 rows)
+(7 rows)

select explain_filter('explain (analyze, buffers, format json) select count(*) from generate_series(1, 100000)');
explain_filter

I don't see any obvious error in the code, so I'm wondering if it's simply
the result of populating the cache with generate_series() info.

Thank you for reporting.

You're right, the regression test with track_io_timing = on is not
entirely stable because we may read catalog data during planning time,
resulting in an additional line in the EXPLAIN output. I've removed
the regression tests.

Hmm, but AFAICS the json format would be stable as the counters are always
shown even if zero. So just doing the json format first and then the text
format should also work. Although if you're really unlucky there could be a
cache invalidation in between so we could just ignore the text format. But I
think we should at least keep a regression test with the json format, with a
comment explain why only this one is tested.

I've attached updated patches. I've included an improvement of
pg_stat_statement support to support temp I/O timing.

Great!

Some other considerations:

- should we update pg_stat_statements documentation (and comments) for
blk_(read|write)_time to mention that it's for *data file* blocks rather than
simply blocks? It seems obvious now that we'd have
temp_blk_(read|write)_time, but still. This part should probably be
backpatched.

- not really your patch fault I guess, but I see that extendBufFile() isn't
handled. There shouldn't be much activity there so maybe it's ok.
This is likely because smgr_extend is also not handled, but this one seems
much more likely to take quite some time, and therefore should bump the
timing counters.

#13Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Julien Rouhaud (#12)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Mon, Apr 4, 2022 at 1:30 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

On Tue, Mar 01, 2022 at 11:35:32AM +0900, Masahiko Sawada wrote:

On Wed, Jan 19, 2022 at 5:52 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

It seems that the regression tests aren't entirely stable, per cfbot:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3298.

The failures look like:

diff -U3 /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/explain.out /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/explain.out
--- /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/explain.out      2022-01-19 03:50:37.087931908 +0000
+++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/explain.out        2022-01-19 03:57:41.013616137 +0000
@@ -512,9 +512,10 @@
I/O Timings: temp read=N.N write=N.N
->  Function Scan on generate_series  (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N loops=N)
I/O Timings: temp read=N.N write=N.N
+   I/O Timings: shared/local read=N.N
Planning Time: N.N ms
Execution Time: N.N ms
-(6 rows)
+(7 rows)

select explain_filter('explain (analyze, buffers, format json) select count(*) from generate_series(1, 100000)');
explain_filter

I don't see any obvious error in the code, so I'm wondering if it's simply
the result of populating the cache with generate_series() info.

Thank you for reporting.

You're right, the regression test with track_io_timing = on is not
entirely stable because we may read catalog data during planning time,
resulting in an additional line in the EXPLAIN output. I've removed
the regression tests.

Hmm, but AFAICS the json format would be stable as the counters are always
shown even if zero. So just doing the json format first and then the text
format should also work. Although if you're really unlucky there could be a
cache invalidation in between so we could just ignore the text format. But I
think we should at least keep a regression test with the json format, with a
comment explain why only this one is tested.

Fair point. By commit 7e12256b478 we disabled track_io_timing, but
probably we can temporarily enable it and run one query with "buffers"
and "format json" options.

I've attached updated patches. I've included an improvement of
pg_stat_statement support to support temp I/O timing.

Great!

Some other considerations:

- should we update pg_stat_statements documentation (and comments) for
blk_(read|write)_time to mention that it's for *data file* blocks rather than
simply blocks? It seems obvious now that we'd have
temp_blk_(read|write)_time, but still. This part should probably be
backpatched.

Agreed.

- not really your patch fault I guess, but I see that extendBufFile() isn't
handled. There shouldn't be much activity there so maybe it's ok.
This is likely because smgr_extend is also not handled, but this one seems
much more likely to take quite some time, and therefore should bump the
timing counters.

You mean we should include the time for opening files as write time?
IIUC smgrextend() writes data while extending file whereas
extendBufFile() doesn't do that but just opens a new file.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#14Julien Rouhaud
rjuju123@gmail.com
In reply to: Masahiko Sawada (#13)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Tue, Apr 05, 2022 at 12:51:12AM +0900, Masahiko Sawada wrote:

On Mon, Apr 4, 2022 at 1:30 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hmm, but AFAICS the json format would be stable as the counters are always
shown even if zero. So just doing the json format first and then the text
format should also work. Although if you're really unlucky there could be a
cache invalidation in between so we could just ignore the text format. But I
think we should at least keep a regression test with the json format, with a
comment explain why only this one is tested.

Fair point. By commit 7e12256b478 we disabled track_io_timing, but
probably we can temporarily enable it and run one query with "buffers"
and "format json" options.

Yes, enabling it for just this query. It can't really find any problem with
the values themselves but at least the new code path would be partially
executed.

- not really your patch fault I guess, but I see that extendBufFile() isn't
handled. There shouldn't be much activity there so maybe it's ok.
This is likely because smgr_extend is also not handled, but this one seems
much more likely to take quite some time, and therefore should bump the
timing counters.

You mean we should include the time for opening files as write time?

Yes. In normal circumstances it shouldn't need a lot of time to do that, but
I'm not so sure with e.g. network filesystems. I'm not strongly in favor of
counting it, especially since smgrextend doesn't either.

IIUC smgrextend() writes data while extending file whereas
extendBufFile() doesn't do that but just opens a new file.

Note that smgrextend can also call register_dirty_segment(), which can also
take some time, so we can't just assume that all the time there is IO time.

#15Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Julien Rouhaud (#14)
2 attachment(s)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Tue, Apr 5, 2022 at 1:31 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Tue, Apr 05, 2022 at 12:51:12AM +0900, Masahiko Sawada wrote:

On Mon, Apr 4, 2022 at 1:30 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hmm, but AFAICS the json format would be stable as the counters are always
shown even if zero. So just doing the json format first and then the text
format should also work. Although if you're really unlucky there could be a
cache invalidation in between so we could just ignore the text format. But I
think we should at least keep a regression test with the json format, with a
comment explain why only this one is tested.

Fair point. By commit 7e12256b478 we disabled track_io_timing, but
probably we can temporarily enable it and run one query with "buffers"
and "format json" options.

Yes, enabling it for just this query. It can't really find any problem with
the values themselves but at least the new code path would be partially
executed.

- not really your patch fault I guess, but I see that extendBufFile() isn't
handled. There shouldn't be much activity there so maybe it's ok.
This is likely because smgr_extend is also not handled, but this one seems
much more likely to take quite some time, and therefore should bump the
timing counters.

You mean we should include the time for opening files as write time?

Yes. In normal circumstances it shouldn't need a lot of time to do that, but
I'm not so sure with e.g. network filesystems. I'm not strongly in favor of
counting it, especially since smgrextend doesn't either.

Good point. I think that adding a new place to track I/O timing can be
a separate patch so probably we can work on it for PG16 or later.

I've attached updated patches, please review it.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

v4-0001-Track-I-O-timing-for-temp-buffers.patchapplication/octet-stream; name=v4-0001-Track-I-O-timing-for-temp-buffers.patchDownload
From b900ab6fe2b9fc22ebd2304ffc5aa51adcfd7898 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 18 Aug 2021 14:55:17 +0900
Subject: [PATCH v4 1/2] Track I/O timing for temp buffers

Previously, the output of EXPLAIN with BUFFERS option shows only I/O
timing spent reading and writing shared and local buffers. This
commit shows I/O timing for temp buffers in the output of EXPLAIN. It
is helpful for users in a case where it is the temp buffers I/O that
is using most of the time.

Same as shared/local buffer I/O timings, temp buffers I/O timing is
shown only when track_io_timing is enabled.
---
 doc/src/sgml/ref/explain.sgml         |   8 +-
 src/backend/commands/explain.c        |  42 ++++++++--
 src/backend/executor/instrument.c     |   6 ++
 src/backend/storage/file/buffile.c    |  29 ++++++-
 src/include/executor/instrument.h     |   6 +-
 src/test/regress/expected/explain.out | 112 ++++++++++++++------------
 src/test/regress/sql/explain.sql      |   8 +-
 7 files changed, 145 insertions(+), 66 deletions(-)

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 4d758fb237..70c65c4ba1 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -174,10 +174,10 @@ ROLLBACK;
       Include information on buffer usage. Specifically, include the number of
       shared blocks hit, read, dirtied, and written, the number of local blocks
       hit, read, dirtied, and written, the number of temp blocks read and
-      written, and the time spent reading and writing data file blocks
-      (in milliseconds) if <xref linkend="guc-track-io-timing"/> is enabled.
-      A <emphasis>hit</emphasis> means that a read was avoided because the block was
-      found already in cache when needed.
+      written, and the time spent reading and writing data file blocks and
+      temp file blocks (in milliseconds) if <xref linkend="guc-track-io-timing"/>
+      is enabled.  A <emphasis>hit</emphasis> means that a read was avoided because
+      the block was found already in cache when needed.
       Shared blocks contain data from regular tables and indexes;
       local blocks contain data from temporary tables and indexes;
       while temp blocks contain short-term working data used in sorts, hashes,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index cb13227db1..197a11d84b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3513,8 +3513,11 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 								usage->temp_blks_written > 0);
 		bool		has_timing = (!INSTR_TIME_IS_ZERO(usage->blk_read_time) ||
 								  !INSTR_TIME_IS_ZERO(usage->blk_write_time));
+		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
+									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
 		bool		show_planning = (planning && (has_shared ||
-												  has_local || has_temp || has_timing));
+												  has_local || has_temp || has_timing ||
+												  has_temp_timing));
 
 		if (show_planning)
 		{
@@ -3579,16 +3582,33 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 		}
 
 		/* As above, show only positive counter values. */
-		if (has_timing)
+		if (has_timing || has_temp_timing)
 		{
 			ExplainIndentText(es);
 			appendStringInfoString(es->str, "I/O Timings:");
-			if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
-				appendStringInfo(es->str, " read=%0.3f",
-								 INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
-			if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
-				appendStringInfo(es->str, " write=%0.3f",
-								 INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
+
+			if (has_timing)
+			{
+				appendStringInfoString(es->str, " shared/local");
+				if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
+					appendStringInfo(es->str, " read=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
+				if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
+					appendStringInfo(es->str, " write=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
+				if (has_temp_timing)
+					appendStringInfoChar(es->str, ',');
+			}
+			if (has_temp_timing)
+			{
+				appendStringInfoString(es->str, " temp");
+				if (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time))
+					appendStringInfo(es->str, " read=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+				if (!INSTR_TIME_IS_ZERO(usage->temp_blk_write_time))
+					appendStringInfo(es->str, " write=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+			}
 			appendStringInfoChar(es->str, '\n');
 		}
 
@@ -3625,6 +3645,12 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			ExplainPropertyFloat("I/O Write Time", "ms",
 								 INSTR_TIME_GET_MILLISEC(usage->blk_write_time),
 								 3, es);
+			ExplainPropertyFloat("Temp I/O Read Time", "ms",
+								 INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time),
+								 3, es);
+			ExplainPropertyFloat("Temp I/O Write Time", "ms",
+								 INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
+								 3, es);
 		}
 	}
 }
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index c5ff02a842..ceff4727d4 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -237,6 +237,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->temp_blks_written += add->temp_blks_written;
 	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
 	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+	INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
+	INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
 }
 
 /* dst += add - sub */
@@ -259,6 +261,10 @@ BufferUsageAccumDiff(BufferUsage *dst,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
+	INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
+						  add->temp_blk_read_time, sub->temp_blk_read_time);
+	INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
+						  add->temp_blk_write_time, sub->temp_blk_write_time);
 }
 
 /* helper functions for WAL usage accumulation */
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3108e96dc9..56b88594cc 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -429,6 +429,8 @@ static void
 BufFileLoadBuffer(BufFile *file)
 {
 	File		thisfile;
+	instr_time	io_start;
+	instr_time	io_time;
 
 	/*
 	 * Advance to next component file if necessary and possible.
@@ -440,10 +442,14 @@ BufFileLoadBuffer(BufFile *file)
 		file->curOffset = 0L;
 	}
 
+	thisfile = file->files[file->curFile];
+
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+
 	/*
 	 * Read whatever we can get, up to a full bufferload.
 	 */
-	thisfile = file->files[file->curFile];
 	file->nbytes = FileRead(thisfile,
 							file->buffer.data,
 							sizeof(file->buffer),
@@ -458,6 +464,13 @@ BufFileLoadBuffer(BufFile *file)
 						FilePathName(thisfile))));
 	}
 
+	if (track_io_timing)
+	{
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, io_start);
+		INSTR_TIME_ADD(pgBufferUsage.temp_blk_read_time, io_time);
+	}
+
 	/* we choose not to advance curOffset here */
 
 	if (file->nbytes > 0)
@@ -485,6 +498,8 @@ BufFileDumpBuffer(BufFile *file)
 	while (wpos < file->nbytes)
 	{
 		off_t		availbytes;
+		instr_time	io_start;
+		instr_time	io_time;
 
 		/*
 		 * Advance to next component file if necessary and possible.
@@ -507,6 +522,10 @@ BufFileDumpBuffer(BufFile *file)
 			bytestowrite = (int) availbytes;
 
 		thisfile = file->files[file->curFile];
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+
 		bytestowrite = FileWrite(thisfile,
 								 file->buffer.data + wpos,
 								 bytestowrite,
@@ -517,6 +536,14 @@ BufFileDumpBuffer(BufFile *file)
 					(errcode_for_file_access(),
 					 errmsg("could not write to file \"%s\": %m",
 							FilePathName(thisfile))));
+
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			INSTR_TIME_ADD(pgBufferUsage.temp_blk_write_time, io_time);
+		}
+
 		file->curOffset += bytestowrite;
 		wpos += bytestowrite;
 
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1b7157bdd1..6c080beb81 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,8 +33,10 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
-	instr_time	blk_read_time;	/* time spent reading */
-	instr_time	blk_write_time; /* time spent writing */
+	instr_time	blk_read_time;	/* time spent reading blocks */
+	instr_time	blk_write_time; /* time spent writing blocks */
+	instr_time	temp_blk_read_time;	/* time spent reading temp blocks */
+	instr_time	temp_blk_write_time; /* time spent writing temp blocks */
 } BufferUsage;
 
 /*
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index bc36175921..49890625c8 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -85,56 +85,6 @@ select explain_filter('explain (analyze, buffers, format text) select * from int
  Execution Time: N.N ms
 (3 rows)
 
-select explain_filter('explain (analyze, buffers, format json) select * from int8_tbl i8');
-           explain_filter           
-------------------------------------
- [                                 +
-   {                               +
-     "Plan": {                     +
-       "Node Type": "Seq Scan",    +
-       "Parallel Aware": false,    +
-       "Async Capable": false,     +
-       "Relation Name": "int8_tbl",+
-       "Alias": "i8",              +
-       "Startup Cost": N.N,        +
-       "Total Cost": N.N,          +
-       "Plan Rows": N,             +
-       "Plan Width": N,            +
-       "Actual Startup Time": N.N, +
-       "Actual Total Time": N.N,   +
-       "Actual Rows": N,           +
-       "Actual Loops": N,          +
-       "Shared Hit Blocks": N,     +
-       "Shared Read Blocks": N,    +
-       "Shared Dirtied Blocks": N, +
-       "Shared Written Blocks": N, +
-       "Local Hit Blocks": N,      +
-       "Local Read Blocks": N,     +
-       "Local Dirtied Blocks": N,  +
-       "Local Written Blocks": N,  +
-       "Temp Read Blocks": N,      +
-       "Temp Written Blocks": N    +
-     },                            +
-     "Planning": {                 +
-       "Shared Hit Blocks": N,     +
-       "Shared Read Blocks": N,    +
-       "Shared Dirtied Blocks": N, +
-       "Shared Written Blocks": N, +
-       "Local Hit Blocks": N,      +
-       "Local Read Blocks": N,     +
-       "Local Dirtied Blocks": N,  +
-       "Local Written Blocks": N,  +
-       "Temp Read Blocks": N,      +
-       "Temp Written Blocks": N    +
-     },                            +
-     "Planning Time": N.N,         +
-     "Triggers": [                 +
-     ],                            +
-     "Execution Time": N.N         +
-   }                               +
- ]
-(1 row)
-
 select explain_filter('explain (analyze, buffers, format xml) select * from int8_tbl i8');
                      explain_filter                     
 --------------------------------------------------------
@@ -276,6 +226,68 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
  ]
 (1 row)
 
+-- Check output including I/O timings. Since these fields are conditional
+-- we check them only in json format.
+set track_io_timing = on;
+select explain_filter('explain (analyze, buffers, format json) select * from int8_tbl i8');
+           explain_filter           
+------------------------------------
+ [                                 +
+   {                               +
+     "Plan": {                     +
+       "Node Type": "Seq Scan",    +
+       "Parallel Aware": false,    +
+       "Async Capable": false,     +
+       "Relation Name": "int8_tbl",+
+       "Alias": "i8",              +
+       "Startup Cost": N.N,        +
+       "Total Cost": N.N,          +
+       "Plan Rows": N,             +
+       "Plan Width": N,            +
+       "Actual Startup Time": N.N, +
+       "Actual Total Time": N.N,   +
+       "Actual Rows": N,           +
+       "Actual Loops": N,          +
+       "Shared Hit Blocks": N,     +
+       "Shared Read Blocks": N,    +
+       "Shared Dirtied Blocks": N, +
+       "Shared Written Blocks": N, +
+       "Local Hit Blocks": N,      +
+       "Local Read Blocks": N,     +
+       "Local Dirtied Blocks": N,  +
+       "Local Written Blocks": N,  +
+       "Temp Read Blocks": N,      +
+       "Temp Written Blocks": N,   +
+       "I/O Read Time": N.N,       +
+       "I/O Write Time": N.N,      +
+       "Temp I/O Read Time": N.N,  +
+       "Temp I/O Write Time": N.N  +
+     },                            +
+     "Planning": {                 +
+       "Shared Hit Blocks": N,     +
+       "Shared Read Blocks": N,    +
+       "Shared Dirtied Blocks": N, +
+       "Shared Written Blocks": N, +
+       "Local Hit Blocks": N,      +
+       "Local Read Blocks": N,     +
+       "Local Dirtied Blocks": N,  +
+       "Local Written Blocks": N,  +
+       "Temp Read Blocks": N,      +
+       "Temp Written Blocks": N,   +
+       "I/O Read Time": N.N,       +
+       "I/O Write Time": N.N,      +
+       "Temp I/O Read Time": N.N,  +
+       "Temp I/O Write Time": N.N  +
+     },                            +
+     "Planning Time": N.N,         +
+     "Triggers": [                 +
+     ],                            +
+     "Execution Time": N.N         +
+   }                               +
+ ]
+(1 row)
+
+set track_io_timing = off;
 -- SETTINGS option
 -- We have to ignore other settings that might be imposed by the environment,
 -- so printing the whole Settings field unfortunately won't do.
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index 5069fa8795..6c30e933ef 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -65,12 +65,18 @@ select explain_filter('explain select * from int8_tbl i8');
 select explain_filter('explain (analyze) select * from int8_tbl i8');
 select explain_filter('explain (analyze, verbose) select * from int8_tbl i8');
 select explain_filter('explain (analyze, buffers, format text) select * from int8_tbl i8');
-select explain_filter('explain (analyze, buffers, format json) select * from int8_tbl i8');
 select explain_filter('explain (analyze, buffers, format xml) select * from int8_tbl i8');
 select explain_filter('explain (analyze, buffers, format yaml) select * from int8_tbl i8');
 select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
 select explain_filter('explain (buffers, format json) select * from int8_tbl i8');
 
+-- Check output including I/O timings. Since these fields are conditional
+-- we check them only in json format.
+
+set track_io_timing = on;
+select explain_filter('explain (analyze, buffers, format json) select * from int8_tbl i8');
+set track_io_timing = off;
+
 -- SETTINGS option
 -- We have to ignore other settings that might be imposed by the environment,
 -- so printing the whole Settings field unfortunately won't do.
-- 
2.24.3 (Apple Git-128)

v4-0002-pg_stat_statements-Track-I-O-timing-for-temp-bloc.patchapplication/octet-stream; name=v4-0002-pg_stat_statements-Track-I-O-timing-for-temp-bloc.patchDownload
From be47febf4d96d307f90cc5dbb5d5548ae39119c2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 1 Mar 2022 10:42:59 +0900
Subject: [PATCH v4 2/2] pg_stat_statements: Track I/O timing for temp blocks.

This commit adds new columsn temp_blk_read_time and
temp_blk_write_time to pg_stat_statements function, showing the time
spent on disk I/O for temp blocks.

Bumped pg_stat_statements version to 1.10 for newly added columns.

XXX: need to bump PGSS_FILE_HEADER.
---
 contrib/pg_stat_statements/Makefile           |  3 +-
 .../expected/oldextversions.out               | 94 +++++++++++++++++++
 .../pg_stat_statements--1.9--1.10.sql         | 59 ++++++++++++
 .../pg_stat_statements/pg_stat_statements.c   | 35 ++++++-
 .../pg_stat_statements.control                |  2 +-
 .../pg_stat_statements/sql/oldextversions.sql | 10 ++
 doc/src/sgml/pgstatstatements.sgml            | 24 ++++-
 7 files changed, 219 insertions(+), 8 deletions(-)
 create mode 100644 contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 7fabd96f38..edc40c8bbf 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -6,7 +6,8 @@ OBJS = \
 	pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.8--1.9.sql \
+DATA = pg_stat_statements--1.4.sql \
+	pg_stat_statements--1.9--1.10.sql pg_stat_statements--1.8--1.9.sql \
 	pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
 	pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
 	pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
diff --git a/contrib/pg_stat_statements/expected/oldextversions.out b/contrib/pg_stat_statements/expected/oldextversions.out
index f18c08838f..fbd60380ee 100644
--- a/contrib/pg_stat_statements/expected/oldextversions.out
+++ b/contrib/pg_stat_statements/expected/oldextversions.out
@@ -136,4 +136,98 @@ SELECT pg_get_functiondef('pg_stat_statements_reset'::regproc);
  
 (1 row)
 
+-- New function pg_stat_statement_info, and new function
+-- and view for pg_stat_statements introduced in 1.9
+AlTER EXTENSION pg_stat_statements UPDATE TO '1.9';
+SELECT pg_get_functiondef('pg_stat_statements_info'::regproc);
+                                                   pg_get_functiondef                                                    
+-------------------------------------------------------------------------------------------------------------------------
+ CREATE OR REPLACE FUNCTION public.pg_stat_statements_info(OUT dealloc bigint, OUT stats_reset timestamp with time zone)+
+  RETURNS record                                                                                                        +
+  LANGUAGE c                                                                                                            +
+  PARALLEL SAFE STRICT                                                                                                  +
+ AS '$libdir/pg_stat_statements', $function$pg_stat_statements_info$function$                                           +
+ 
+(1 row)
+
+\d pg_stat_statements
+                    View "public.pg_stat_statements"
+       Column        |       Type       | Collation | Nullable | Default 
+---------------------+------------------+-----------+----------+---------
+ userid              | oid              |           |          | 
+ dbid                | oid              |           |          | 
+ toplevel            | boolean          |           |          | 
+ queryid             | bigint           |           |          | 
+ query               | text             |           |          | 
+ plans               | bigint           |           |          | 
+ total_plan_time     | double precision |           |          | 
+ min_plan_time       | double precision |           |          | 
+ max_plan_time       | double precision |           |          | 
+ mean_plan_time      | double precision |           |          | 
+ stddev_plan_time    | double precision |           |          | 
+ calls               | bigint           |           |          | 
+ total_exec_time     | double precision |           |          | 
+ min_exec_time       | double precision |           |          | 
+ max_exec_time       | double precision |           |          | 
+ mean_exec_time      | double precision |           |          | 
+ stddev_exec_time    | double precision |           |          | 
+ rows                | bigint           |           |          | 
+ shared_blks_hit     | bigint           |           |          | 
+ shared_blks_read    | bigint           |           |          | 
+ shared_blks_dirtied | bigint           |           |          | 
+ shared_blks_written | bigint           |           |          | 
+ local_blks_hit      | bigint           |           |          | 
+ local_blks_read     | bigint           |           |          | 
+ local_blks_dirtied  | bigint           |           |          | 
+ local_blks_written  | bigint           |           |          | 
+ temp_blks_read      | bigint           |           |          | 
+ temp_blks_written   | bigint           |           |          | 
+ blk_read_time       | double precision |           |          | 
+ blk_write_time      | double precision |           |          | 
+ wal_records         | bigint           |           |          | 
+ wal_fpi             | bigint           |           |          | 
+ wal_bytes           | numeric          |           |          | 
+
+-- New functions and views for pg_stat_statements in 1.10
+AlTER EXTENSION pg_stat_statements UPDATE TO '1.10';
+\d pg_stat_statements
+                    View "public.pg_stat_statements"
+       Column        |       Type       | Collation | Nullable | Default 
+---------------------+------------------+-----------+----------+---------
+ userid              | oid              |           |          | 
+ dbid                | oid              |           |          | 
+ toplevel            | boolean          |           |          | 
+ queryid             | bigint           |           |          | 
+ query               | text             |           |          | 
+ plans               | bigint           |           |          | 
+ total_plan_time     | double precision |           |          | 
+ min_plan_time       | double precision |           |          | 
+ max_plan_time       | double precision |           |          | 
+ mean_plan_time      | double precision |           |          | 
+ stddev_plan_time    | double precision |           |          | 
+ calls               | bigint           |           |          | 
+ total_exec_time     | double precision |           |          | 
+ min_exec_time       | double precision |           |          | 
+ max_exec_time       | double precision |           |          | 
+ mean_exec_time      | double precision |           |          | 
+ stddev_exec_time    | double precision |           |          | 
+ rows                | bigint           |           |          | 
+ shared_blks_hit     | bigint           |           |          | 
+ shared_blks_read    | bigint           |           |          | 
+ shared_blks_dirtied | bigint           |           |          | 
+ shared_blks_written | bigint           |           |          | 
+ local_blks_hit      | bigint           |           |          | 
+ local_blks_read     | bigint           |           |          | 
+ local_blks_dirtied  | bigint           |           |          | 
+ local_blks_written  | bigint           |           |          | 
+ temp_blks_read      | bigint           |           |          | 
+ temp_blks_written   | bigint           |           |          | 
+ blk_read_time       | double precision |           |          | 
+ blk_write_time      | double precision |           |          | 
+ temp_blk_read_time  | double precision |           |          | 
+ temp_blk_write_time | double precision |           |          | 
+ wal_records         | bigint           |           |          | 
+ wal_fpi             | bigint           |           |          | 
+ wal_bytes           | numeric          |           |          | 
+
 DROP EXTENSION pg_stat_statements;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql b/contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql
new file mode 100644
index 0000000000..ca777f14e8
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql
@@ -0,0 +1,59 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.10'" to load this file. \quit
+
+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements(boolean);
+
+/* Then we can drop them */
+DROP VIEW pg_stat_statements;
+DROP FUNCTION pg_stat_statements(boolean);
+
+/* Now redefine */
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+    OUT userid oid,
+    OUT dbid oid,
+    OUT toplevel bool,
+    OUT queryid bigint,
+    OUT query text,
+    OUT plans int8,
+    OUT total_plan_time float8,
+    OUT min_plan_time float8,
+    OUT max_plan_time float8,
+    OUT mean_plan_time float8,
+    OUT stddev_plan_time float8,
+    OUT calls int8,
+    OUT total_exec_time float8,
+    OUT min_exec_time float8,
+    OUT max_exec_time float8,
+    OUT mean_exec_time float8,
+    OUT stddev_exec_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_dirtied int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_dirtied int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT blk_read_time float8,
+    OUT blk_write_time float8,
+    OUT temp_blk_read_time float8,
+    OUT temp_blk_write_time float8,
+    OUT wal_records int8,
+    OUT wal_fpi int8,
+    OUT wal_bytes numeric
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_stat_statements_1_10'
+LANGUAGE C STRICT VOLATILE PARALLEL SAFE;
+
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements(true);
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 55786ae84f..f01aaca2cc 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -121,7 +121,8 @@ typedef enum pgssVersion
 	PGSS_V1_2,
 	PGSS_V1_3,
 	PGSS_V1_8,
-	PGSS_V1_9
+	PGSS_V1_9,
+	PGSS_V1_10
 } pgssVersion;
 
 typedef enum pgssStoreKind
@@ -183,8 +184,10 @@ typedef struct Counters
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
-	double		blk_read_time;	/* time spent reading, in msec */
-	double		blk_write_time; /* time spent writing, in msec */
+	double		blk_read_time;	/* time spent reading blocks, in msec */
+	double		blk_write_time; /* time spent writing blocks, in msec */
+	double		temp_blk_read_time;	/* time spent reading temp blocks, in msec */
+	double		temp_blk_write_time; /* time spent writing temp blocks, in msec */
 	double		usage;			/* usage factor */
 	int64		wal_records;	/* # of WAL records generated */
 	int64		wal_fpi;		/* # of WAL full page images generated */
@@ -302,6 +305,7 @@ PG_FUNCTION_INFO_V1(pg_stat_statements_1_2);
 PG_FUNCTION_INFO_V1(pg_stat_statements_1_3);
 PG_FUNCTION_INFO_V1(pg_stat_statements_1_8);
 PG_FUNCTION_INFO_V1(pg_stat_statements_1_9);
+PG_FUNCTION_INFO_V1(pg_stat_statements_1_10);
 PG_FUNCTION_INFO_V1(pg_stat_statements);
 PG_FUNCTION_INFO_V1(pg_stat_statements_info);
 
@@ -1369,6 +1373,8 @@ pgss_store(const char *query, uint64 queryId,
 		e->counters.temp_blks_written += bufusage->temp_blks_written;
 		e->counters.blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_read_time);
 		e->counters.blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_write_time);
+		e->counters.temp_blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->temp_blk_read_time);
+		e->counters.temp_blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->temp_blk_write_time);
 		e->counters.usage += USAGE_EXEC(total_time);
 		e->counters.wal_records += walusage->wal_records;
 		e->counters.wal_fpi += walusage->wal_fpi;
@@ -1422,7 +1428,8 @@ pg_stat_statements_reset(PG_FUNCTION_ARGS)
 #define PG_STAT_STATEMENTS_COLS_V1_3	23
 #define PG_STAT_STATEMENTS_COLS_V1_8	32
 #define PG_STAT_STATEMENTS_COLS_V1_9	33
-#define PG_STAT_STATEMENTS_COLS			33	/* maximum of above */
+#define PG_STAT_STATEMENTS_COLS_V1_10	35
+#define PG_STAT_STATEMENTS_COLS			35	/* maximum of above */
 
 /*
  * Retrieve statement statistics.
@@ -1434,6 +1441,16 @@ pg_stat_statements_reset(PG_FUNCTION_ARGS)
  * expected API version is identified by embedding it in the C name of the
  * function.  Unfortunately we weren't bright enough to do that for 1.1.
  */
+Datum
+pg_stat_statements_1_10(PG_FUNCTION_ARGS)
+{
+	bool		showtext = PG_GETARG_BOOL(0);
+
+	pg_stat_statements_internal(fcinfo, PGSS_V1_10, showtext);
+
+	return (Datum) 0;
+}
+
 Datum
 pg_stat_statements_1_9(PG_FUNCTION_ARGS)
 {
@@ -1547,6 +1564,10 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 			if (api_version != PGSS_V1_9)
 				elog(ERROR, "incorrect number of output arguments");
 			break;
+		case PG_STAT_STATEMENTS_COLS_V1_10:
+			if (api_version != PGSS_V1_10)
+				elog(ERROR, "incorrect number of output arguments");
+			break;
 		default:
 			elog(ERROR, "incorrect number of output arguments");
 	}
@@ -1747,6 +1768,11 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 			values[i++] = Float8GetDatumFast(tmp.blk_read_time);
 			values[i++] = Float8GetDatumFast(tmp.blk_write_time);
 		}
+		if (api_version >= PGSS_V1_10)
+		{
+			values[i++] = Float8GetDatumFast(tmp.temp_blk_read_time);
+			values[i++] = Float8GetDatumFast(tmp.temp_blk_write_time);
+		}
 		if (api_version >= PGSS_V1_8)
 		{
 			char		buf[256];
@@ -1771,6 +1797,7 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 					 api_version == PGSS_V1_3 ? PG_STAT_STATEMENTS_COLS_V1_3 :
 					 api_version == PGSS_V1_8 ? PG_STAT_STATEMENTS_COLS_V1_8 :
 					 api_version == PGSS_V1_9 ? PG_STAT_STATEMENTS_COLS_V1_9 :
+					 api_version == PGSS_V1_10 ? PG_STAT_STATEMENTS_COLS_V1_10 :
 					 -1 /* fail if you forget to update this assert */ ));
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
diff --git a/contrib/pg_stat_statements/pg_stat_statements.control b/contrib/pg_stat_statements/pg_stat_statements.control
index 2f1ce6ed50..0747e48138 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.control
+++ b/contrib/pg_stat_statements/pg_stat_statements.control
@@ -1,5 +1,5 @@
 # pg_stat_statements extension
 comment = 'track planning and execution statistics of all SQL statements executed'
-default_version = '1.9'
+default_version = '1.10'
 module_pathname = '$libdir/pg_stat_statements'
 relocatable = true
diff --git a/contrib/pg_stat_statements/sql/oldextversions.sql b/contrib/pg_stat_statements/sql/oldextversions.sql
index f2e822acd3..ea0f7c143d 100644
--- a/contrib/pg_stat_statements/sql/oldextversions.sql
+++ b/contrib/pg_stat_statements/sql/oldextversions.sql
@@ -36,4 +36,14 @@ AlTER EXTENSION pg_stat_statements UPDATE TO '1.8';
 \d pg_stat_statements
 SELECT pg_get_functiondef('pg_stat_statements_reset'::regproc);
 
+-- New function pg_stat_statement_info, and new function
+-- and view for pg_stat_statements introduced in 1.9
+AlTER EXTENSION pg_stat_statements UPDATE TO '1.9';
+SELECT pg_get_functiondef('pg_stat_statements_info'::regproc);
+\d pg_stat_statements
+
+-- New functions and views for pg_stat_statements in 1.10
+AlTER EXTENSION pg_stat_statements UPDATE TO '1.10';
+\d pg_stat_statements
+
 DROP EXTENSION pg_stat_statements;
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 3a7e36bd13..62b15dd19e 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -338,7 +338,7 @@
        <structfield>blk_read_time</structfield> <type>double precision</type>
       </para>
       <para>
-       Total time the statement spent reading blocks, in milliseconds
+       Total time the statement spent reading data file blocks, in milliseconds
        (if <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero)
       </para></entry>
      </row>
@@ -348,7 +348,27 @@
        <structfield>blk_write_time</structfield> <type>double precision</type>
       </para>
       <para>
-       Total time the statement spent writing blocks, in milliseconds
+       Total time the statement spent writing data file blocks, in milliseconds
+       (if <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>temp_blk_read_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total time the statement spent reading temp file blocks, in milliseconds
+       (if <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>temp_blk_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total time the statement spent writing temp file blocks, in milliseconds
        (if <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero)
       </para></entry>
      </row>
-- 
2.24.3 (Apple Git-128)

#16Julien Rouhaud
rjuju123@gmail.com
In reply to: Masahiko Sawada (#15)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Tue, Apr 05, 2022 at 10:40:04AM +0900, Masahiko Sawada wrote:

On Tue, Apr 5, 2022 at 1:31 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

Yes. In normal circumstances it shouldn't need a lot of time to do that, but
I'm not so sure with e.g. network filesystems. I'm not strongly in favor of
counting it, especially since smgrextend doesn't either.

Good point. I think that adding a new place to track I/O timing can be
a separate patch so probably we can work on it for PG16 or later.

Agreed.

I've attached updated patches, please review it.

It looks good to me, just one minor thing in 002:

@@ -183,8 +184,10 @@ typedef struct Counters
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
-	double		blk_read_time;	/* time spent reading, in msec */
-	double		blk_write_time; /* time spent writing, in msec */
+	double		blk_read_time;	/* time spent reading blocks, in msec */
+	double		blk_write_time; /* time spent writing blocks, in msec */
+	double		temp_blk_read_time;	/* time spent reading temp blocks, in msec */
+	double		temp_blk_write_time; /* time spent writing temp blocks, in msec */

maybe the comments should respectively be data file blocks and temp file
blocks.

This is a minor detail and the rest of the patch looks good to me, so I'm
marking the patch as Ready for Committer!

#17Michael Paquier
michael@paquier.xyz
In reply to: Julien Rouhaud (#16)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Tue, Apr 05, 2022 at 11:57:14AM +0800, Julien Rouhaud wrote:

This is a minor detail and the rest of the patch looks good to me, so I'm
marking the patch as Ready for Committer!

@@ -440,10 +442,14 @@ BufFileLoadBuffer(BufFile *file)
+   if (track_io_timing)
+       INSTR_TIME_SET_CURRENT(io_start);

In places where we don't have clock_gettime(), this means using
gettimeofday(). I would not underestimate the performance impact of
such a change, even if track_io_timing is already known to have a
certain overhead on some platforms.
--
Michael

#18Julien Rouhaud
rjuju123@gmail.com
In reply to: Michael Paquier (#17)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Thu, Apr 07, 2022 at 03:58:46PM +0900, Michael Paquier wrote:

On Tue, Apr 05, 2022 at 11:57:14AM +0800, Julien Rouhaud wrote:

This is a minor detail and the rest of the patch looks good to me, so I'm
marking the patch as Ready for Committer!

@@ -440,10 +442,14 @@ BufFileLoadBuffer(BufFile *file)
+   if (track_io_timing)
+       INSTR_TIME_SET_CURRENT(io_start);

In places where we don't have clock_gettime(), this means using
gettimeofday(). I would not underestimate the performance impact of
such a change, even if track_io_timing is already known to have a
certain overhead on some platforms.

Sure, but gettimeofday() has been implemented in vDSO for quite some time on
most platforms, so it shouldn't hurt that much on mainstream platforms
especially compared to the cost of whatever operation is actually using that
temporary file.

I don't think that having an extra GUC for temp IO is sensible, if that's why
you're suggesting? Or are you just asking to do some benchmarking on some
platform where getting the time is known to be slow (Windows?).

#19Michael Paquier
michael@paquier.xyz
In reply to: Julien Rouhaud (#18)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Thu, Apr 07, 2022 at 03:14:01PM +0800, Julien Rouhaud wrote:

Sure, but gettimeofday() has been implemented in vDSO for quite some time on
most platforms, so it shouldn't hurt that much on mainstream platforms
especially compared to the cost of whatever operation is actually using that
temporary file.

I don't think that having an extra GUC for temp IO is sensible, if that's why
you're suggesting? Or are you just asking to do some benchmarking on some
platform where getting the time is known to be slow (Windows?).

I am asking about the latter, but the former could be one solution if
the latter proves to be a problem, and this has not been discussed on
the thread yet. So, with some kind of worst-case scenario, how much
worse the performance gets once you add those extra calls when
compared to HEAD? I think that we'd better be careful with any
additions of INSTR_TIME_SET_CURRENT().
--
Michael

#20Julien Rouhaud
rjuju123@gmail.com
In reply to: Michael Paquier (#19)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Thu, Apr 07, 2022 at 04:24:54PM +0900, Michael Paquier wrote:

On Thu, Apr 07, 2022 at 03:14:01PM +0800, Julien Rouhaud wrote:

Sure, but gettimeofday() has been implemented in vDSO for quite some time on
most platforms, so it shouldn't hurt that much on mainstream platforms
especially compared to the cost of whatever operation is actually using that
temporary file.

I don't think that having an extra GUC for temp IO is sensible, if that's why
you're suggesting? Or are you just asking to do some benchmarking on some
platform where getting the time is known to be slow (Windows?).

I am asking about the latter, but the former could be one solution if
the latter proves to be a problem, and this has not been discussed on
the thread yet. So, with some kind of worst-case scenario, how much
worse the performance gets once you add those extra calls when
compared to HEAD? I think that we'd better be careful with any
additions of INSTR_TIME_SET_CURRENT().

I just did a quick test on my linux box, using this data:
CREATE TABLE tt AS select generate_series(1, 10000) id;
VACUUM ANALYZE tt;

and this scenario:
SET work_mem TO '64kB';
SELECT count(*) FROM (SELECT id FROM tt ORDER BY id) s;

which yields this plan:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1349.39..1349.40 rows=1 width=8) (actual time=5.417..5.417 rows=1 loops=1)
-> Sort (cost=1199.39..1224.39 rows=10000 width=4) (actual time=2.910..4.422 rows=10000 loops=1)
Sort Key: tt.id
Sort Method: external merge Disk: 144kB
-> Seq Scan on tt (cost=0.00..145.00 rows=10000 width=4) (actual time=0.008..1.239 rows=10000 loops=1)
Planning Time: 0.405 ms
Execution Time: 5.524 ms

So maybe not the worst that could be tested, but probably bad enough for this
patch.

I ran that with pgbench, 4 clients (I have 4 cores) for 30 seconds, 3 times.

Comparing master and this patch with track_io_timing activated, I see a 0.95%
overhead, with a 2.6% noise level.

#21Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Julien Rouhaud (#20)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Thu, Apr 7, 2022 at 4:55 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Thu, Apr 07, 2022 at 04:24:54PM +0900, Michael Paquier wrote:

On Thu, Apr 07, 2022 at 03:14:01PM +0800, Julien Rouhaud wrote:

Sure, but gettimeofday() has been implemented in vDSO for quite some time on
most platforms, so it shouldn't hurt that much on mainstream platforms
especially compared to the cost of whatever operation is actually using that
temporary file.

I don't think that having an extra GUC for temp IO is sensible, if that's why
you're suggesting? Or are you just asking to do some benchmarking on some
platform where getting the time is known to be slow (Windows?).

I am asking about the latter, but the former could be one solution if
the latter proves to be a problem, and this has not been discussed on
the thread yet. So, with some kind of worst-case scenario, how much
worse the performance gets once you add those extra calls when
compared to HEAD? I think that we'd better be careful with any
additions of INSTR_TIME_SET_CURRENT().

I just did a quick test on my linux box, using this data:
CREATE TABLE tt AS select generate_series(1, 10000) id;
VACUUM ANALYZE tt;

and this scenario:
SET work_mem TO '64kB';
SELECT count(*) FROM (SELECT id FROM tt ORDER BY id) s;

which yields this plan:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1349.39..1349.40 rows=1 width=8) (actual time=5.417..5.417 rows=1 loops=1)
-> Sort (cost=1199.39..1224.39 rows=10000 width=4) (actual time=2.910..4.422 rows=10000 loops=1)
Sort Key: tt.id
Sort Method: external merge Disk: 144kB
-> Seq Scan on tt (cost=0.00..145.00 rows=10000 width=4) (actual time=0.008..1.239 rows=10000 loops=1)
Planning Time: 0.405 ms
Execution Time: 5.524 ms

So maybe not the worst that could be tested, but probably bad enough for this
patch.

I ran that with pgbench, 4 clients (I have 4 cores) for 30 seconds, 3 times.

Comparing master and this patch with track_io_timing activated, I see a 0.95%
overhead, with a 2.6% noise level.

I've done the same test with a larger data set (10M tuples) on my
machine (MacOS):

HEAD: 5418.869 ms
Patched: 5367.234 ms

I can see about 1% overhead.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#22Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#21)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Thu, Apr 7, 2022 at 5:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 7, 2022 at 4:55 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Thu, Apr 07, 2022 at 04:24:54PM +0900, Michael Paquier wrote:

On Thu, Apr 07, 2022 at 03:14:01PM +0800, Julien Rouhaud wrote:

Sure, but gettimeofday() has been implemented in vDSO for quite some time on
most platforms, so it shouldn't hurt that much on mainstream platforms
especially compared to the cost of whatever operation is actually using that
temporary file.

I don't think that having an extra GUC for temp IO is sensible, if that's why
you're suggesting? Or are you just asking to do some benchmarking on some
platform where getting the time is known to be slow (Windows?).

I am asking about the latter, but the former could be one solution if
the latter proves to be a problem, and this has not been discussed on
the thread yet. So, with some kind of worst-case scenario, how much
worse the performance gets once you add those extra calls when
compared to HEAD? I think that we'd better be careful with any
additions of INSTR_TIME_SET_CURRENT().

I just did a quick test on my linux box, using this data:
CREATE TABLE tt AS select generate_series(1, 10000) id;
VACUUM ANALYZE tt;

and this scenario:
SET work_mem TO '64kB';
SELECT count(*) FROM (SELECT id FROM tt ORDER BY id) s;

which yields this plan:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1349.39..1349.40 rows=1 width=8) (actual time=5.417..5.417 rows=1 loops=1)
-> Sort (cost=1199.39..1224.39 rows=10000 width=4) (actual time=2.910..4.422 rows=10000 loops=1)
Sort Key: tt.id
Sort Method: external merge Disk: 144kB
-> Seq Scan on tt (cost=0.00..145.00 rows=10000 width=4) (actual time=0.008..1.239 rows=10000 loops=1)
Planning Time: 0.405 ms
Execution Time: 5.524 ms

So maybe not the worst that could be tested, but probably bad enough for this
patch.

I ran that with pgbench, 4 clients (I have 4 cores) for 30 seconds, 3 times.

Comparing master and this patch with track_io_timing activated, I see a 0.95%
overhead, with a 2.6% noise level.

I've done the same test with a larger data set (10M tuples) on my
machine (MacOS):

HEAD: 5418.869 ms
Patched: 5367.234 ms

Oops, the results are opposite:

HEAD: 5367.234 ms
Patched: 5418.869 ms

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#23Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#22)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Thu, Apr 07, 2022 at 09:48:02PM +0900, Masahiko Sawada wrote:

Oops, the results are opposite:

HEAD: 5367.234 ms
Patched: 5418.869 ms

I have been playing with external sorts & friends after running an
instance on scissors (fsync=off, PGDATA on tmpfs, etc.), even forcing
a compilation of the code with gettimeofday(), and I can see a
tendency of a ~1% impact with the patch in this configuration. So
there is a tendency, while this is also rather close to the usual
noise range. Anyway, this is going to be helpful for debugging when
temp file I/O is a bottleneck, so.. Applied 0001 with some tweaks to
the docs and some indentation fixes.

Now, 0002 is straight-forward but I need more coffee and lunch..
--
Michael

#24Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#23)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Fri, Apr 08, 2022 at 11:34:17AM +0900, Michael Paquier wrote:

Now, 0002 is straight-forward but I need more coffee and lunch..

Done this one as well, as of 76cbf7e with few tweaks. 1.9 and 1.10
changed the definition of pg_stat_statements, so I have added two
extra queries for those upgrade paths in oldextversions.sql.
--
Michael

#25Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#24)
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

On Fri, Apr 8, 2022 at 1:37 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Apr 08, 2022 at 11:34:17AM +0900, Michael Paquier wrote:

Now, 0002 is straight-forward but I need more coffee and lunch..

Done this one as well, as of 76cbf7e with few tweaks. 1.9 and 1.10
changed the definition of pg_stat_statements, so I have added two
extra queries for those upgrade paths in oldextversions.sql.

Thank you for committing both patches. Agreed with these changes.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/